Reuters-21578 text classification with Gensim and Keras

Reuters-21578 is a collection of about 20K news lines (see reference for more information, downloads, and copyright notice), structured using SGML and categorized with 672 labels. They are divided into five main categories:

- Topics
- Places
- People
- Organizations
- Exchanges

However, most of them are unused, and looking at the distribution, it’s possible to notice a complete lack of homogeneity. These are the 20 top categories (the prefix is made with the two initial letters of each main category) with the number of related news-lines:

ID	Name	Category	Newslines
161	pl_usa	Places	12542
533	to_earn	Topics	3987
498	to_acq	Topics	2448
158	pl_uk	Places	1489
84	pl_japan	Places	1138
31	pl_canada	Places	1104
571	to_money-fx	Topics	801
526	to_crude	Topics	634
543	to_grain	Topics	628
167	pl_west-germany	Places	567
624	to_trade	Topics	552
553	to_interest	Topics	513
56	pl_france	Places	469
185	or_ec	Organizations	349
23	pl_brazil	Places	332
628	to_wheat	Topics	306
606	to_ship	Topics	305
10	pl_australia	Places	270
517	to_corn	Topics	254
37	pl_china	Places	223

In the “experiment” (as Jupyter notebook) you can find on this Github repository, I’ve defined a pipeline for a One-Vs-Rest categorization method using Word2Vec (implemented by Gensim), which is much more effective than a standard bag-of-words or Tf-Idf approach, and LSTM neural networks (modeled with Keras with Theano/GPU support). The pipeline is based on the following steps (just like a sentiment analysis approach):

1. Category and document acquisition (I suggest seeing the complete code on Github). However, I’ve used BeautifulSoup to parse all SGML files, removing all unwanted tags and a simple regex to strip the ending signature.
2. Tokenizing (with Wordnet lemmatization and stop-words filtering, both implemented by NLTK framework)# Load stop-word

# Load stop-words
stop_words = set(stopwords.words('english'))

# Initialize tokenizer
# It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer
tokenizer = RegexpTokenizer('['a-zA-Z]+')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenized document collection
newsline_documents = []

def tokenize(document):
 words = []

 for sentence in sent_tokenize(document):
 tokens = [lemmatizer.lemmatize(t.lower()) for t in tokenizer.tokenize(sentence) if t.lower() not in stop_words]
 words += tokens

 return words

Word2Vec training:

# Create new Gensim Word2Vec model
w2v_model = Word2Vec(newsline_documents, size=num_features, min_count=1, window=10, workers=cpu_count())
w2v_model.init_sims(replace=True)
w2v_model.save(data_folder + 'reuters.word2vec')

Word2Vec conversion

num_categories = len(selected_categories)
X = zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(float32)
Y = zeros(shape=(number_of_documents, num_categories)).astype(float32)

empty_word = zeros(num_features).astype(float32)

for idx, document in enumerate(newsline_documents):
    for jdx, word in enumerate(document):
        if jdx == document_max_num_words:
            break
            
        else:
            if word in w2v_model:
                X[idx, jdx, :] = w2v_model[word]
            else:
                X[idx, jdx, :] = empty_word

for idx, key in enumerate(document_Y.keys()):
    Y[idx, :] = document_Y[key]

LSTM network training

model = Sequential()

model.add(LSTM(int(document_max_num_words*1.5), input_shape=(document_max_num_words, num_features)))
model.add(Dropout(0.3))
model.add(Dense(num_categories))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X_train, Y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, Y_test))

# Evaluate model
score, acc = model.evaluate(X_test, Y_test, batch_size=128)
    
print('Score: %1.4f' % score)
print('Accuracy: %1.4f' % acc)

I’ve tested different neural models and got the best results with an LSTM layer of 150 cells (1.5 x number of words), a dropout of 0.3-0.4, and an “Adam” optimizer. A batch size of less than 64 can cause a slowdown in the learning rate. In my experiments, 128 is an excellent value. After five epochs (they should be incremented in a real-world scenario), I have these results:

Train on 15104 samples, validate on 6474 samples
Epoch 1/5
15104/15104 [==============================] - 38s - loss: 0.5610 - acc: 0.7168 - val_loss: 0.6814 - val_acc: 0.5828
Epoch 2/5
15104/15104 [==============================] - 40s - loss: 0.5994 - acc: 0.6597 - val_loss: 0.4435 - val_acc: 0.8230
Epoch 3/5
15104/15104 [==============================] - 37s - loss: 0.3949 - acc: 0.8477 - val_loss: 0.3765 - val_acc: 0.8557
Epoch 4/5
15104/15104 [==============================] - 38s - loss: 0.3567 - acc: 0.8707 - val_loss: 0.3415 - val_acc: 0.8735
Epoch 5/5
15104/15104 [==============================] - 37s - loss: 0.3408 - acc: 0.8761 - val_loss: 0.3269 - val_acc: 0.8837
6474/6474 [==============================] - 5s     
Score: 0.3269
Accuracy: 0.8837

Reference

- Dataset: https://kdd.ics.uci.edu/databases/reuters21578/
- Dataset info: http://kdd.ics.uci.edu/databases/reuters21578/README.txt
- Gensim Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html
- Word2Vec Tutorial: http://radimrehurek.com/2014/02/word2vec-tutorial/
- Word2Vec and Doc2Vec models: http://arxiv.org/pdf/1405.4053v2.pdf
- LSTM: http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf

Share this post on:

Reuters-21578 text classification with Gensim and Keras

Reference

Related Posts