Reuters-21578 text classification with Gensim and Keras

Reuters-21578 is a collection of about 20K news lines (see reference for more information, downloads, and copyright notice), structured using SGML and categorized with 672 labels. They are divided into five main categories:

    • Topics
    • Places
    • People
    • Organizations
    • Exchanges

However, most of them are unused, and looking at the distribution, it’s possible to notice a complete lack of homogeneity. These are the 20 top categories (the prefix is made with the two initial letters of each main category) with the number of related news-lines:

ID Name Category Newslines
161 pl_usa Places 12542
533 to_earn Topics 3987
498 to_acq Topics 2448
158 pl_uk Places 1489
84 pl_japan Places 1138
31 pl_canada Places 1104
571 to_money-fx Topics 801
526 to_crude Topics 634
543 to_grain Topics 628
167 pl_west-germany Places 567
624 to_trade Topics 552
553 to_interest Topics 513
56 pl_france Places 469
185 or_ec Organizations 349
23 pl_brazil Places 332
628 to_wheat Topics 306
606 to_ship Topics 305
10 pl_australia Places 270
517 to_corn Topics 254
37 pl_china Places 223

In the “experiment” (as Jupyter notebook) you can find on this Github repository, I’ve defined a pipeline for a One-Vs-Rest categorization method using Word2Vec  (implemented by Gensim), which is much more effective than a standard bag-of-words or Tf-Idf approach, and LSTM neural networks (modeled with Keras with Theano/GPU support). The pipeline is based on the following steps (just like a sentiment analysis approach):

    1. Category and document acquisition (I suggest seeing the complete code on Github). However, I’ve used BeautifulSoup to parse all SGML files, removing all unwanted tags and a simple regex to strip the ending signature.
    2. Tokenizing (with Wordnet lemmatization and stop-words filtering, both implemented by NLTK framework)# Load stop-word

 

# Load stop-words
stop_words = set(stopwords.words('english'))

# Initialize tokenizer
# It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer
tokenizer = RegexpTokenizer('['a-zA-Z]+')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Tokenized document collection
newsline_documents = []

def tokenize(document):
 words = []

 for sentence in sent_tokenize(document):
 tokens = [lemmatizer.lemmatize(t.lower()) for t in tokenizer.tokenize(sentence) if t.lower() not in stop_words]
 words += tokens

 return words

Word2Vec training:

# Create new Gensim Word2Vec model
w2v_model = Word2Vec(newsline_documents, size=num_features, min_count=1, window=10, workers=cpu_count())
w2v_model.init_sims(replace=True)
w2v_model.save(data_folder + 'reuters.word2vec')

Word2Vec conversion

num_categories = len(selected_categories)
X = zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(float32)
Y = zeros(shape=(number_of_documents, num_categories)).astype(float32)

empty_word = zeros(num_features).astype(float32)

for idx, document in enumerate(newsline_documents):
    for jdx, word in enumerate(document):
        if jdx == document_max_num_words:
            break
            
        else:
            if word in w2v_model:
                X[idx, jdx, :] = w2v_model[word]
            else:
                X[idx, jdx, :] = empty_word

for idx, key in enumerate(document_Y.keys()):
    Y[idx, :] = document_Y[key]

LSTM network training

model = Sequential()

model.add(LSTM(int(document_max_num_words*1.5), input_shape=(document_max_num_words, num_features)))
model.add(Dropout(0.3))
model.add(Dense(num_categories))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train model
model.fit(X_train, Y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, Y_test))

# Evaluate model
score, acc = model.evaluate(X_test, Y_test, batch_size=128)
    
print('Score: %1.4f' % score)
print('Accuracy: %1.4f' % acc)

I’ve tested different neural models and got the best results with an LSTM layer of 150 cells (1.5 x number of words), a dropout of 0.3-0.4, and an “Adam” optimizer. A batch size of less than 64 can cause a slowdown in the learning rate. In my experiments, 128 is an excellent value. After five epochs (they should be incremented in a real-world scenario), I have these results:

Train on 15104 samples, validate on 6474 samples
Epoch 1/5
15104/15104 [==============================] - 38s - loss: 0.5610 - acc: 0.7168 - val_loss: 0.6814 - val_acc: 0.5828
Epoch 2/5
15104/15104 [==============================] - 40s - loss: 0.5994 - acc: 0.6597 - val_loss: 0.4435 - val_acc: 0.8230
Epoch 3/5
15104/15104 [==============================] - 37s - loss: 0.3949 - acc: 0.8477 - val_loss: 0.3765 - val_acc: 0.8557
Epoch 4/5
15104/15104 [==============================] - 38s - loss: 0.3567 - acc: 0.8707 - val_loss: 0.3415 - val_acc: 0.8735
Epoch 5/5
15104/15104 [==============================] - 37s - loss: 0.3408 - acc: 0.8761 - val_loss: 0.3269 - val_acc: 0.8837
6474/6474 [==============================] - 5s     
Score: 0.3269
Accuracy: 0.8837

Reference


Share this post on:
FacebookTwitterPinterestEmail

Subscribe to the weekly newsletter!

You will only be updated about new content. No spam, no adv!