Reuters-21578 text classification with Gensim and Keras

Fork

Reuters-21578 is a collection of about 20K news-lines (see reference for more information, downloads and copyright notice), structured using SGML and categorized with 672 labels. They are diveded into five main categories:

  • Topics
  • Places
  • People
  • Organizations
  • Exchanges

However, most of them are unused and, looking at the distribution, it’s possible to notice a complete lack of homogeneity. These are the 20 top categories (the prefix is made with the two initial letter of each main category) with the number of related news-lines:

ID Name Category Newslines
161 pl_usa Places 12542
533 to_earn Topics 3987
498 to_acq Topics 2448
158 pl_uk Places 1489
84 pl_japan Places 1138
31 pl_canada Places 1104
571 to_money-fx Topics 801
526 to_crude Topics 634
543 to_grain Topics 628
167 pl_west-germany Places 567
624 to_trade Topics 552
553 to_interest Topics 513
56 pl_france Places 469
185 or_ec Organizations 349
23 pl_brazil Places 332
628 to_wheat Topics 306
606 to_ship Topics 305
10 pl_australia Places 270
517 to_corn Topics 254
37 pl_china Places 223

 

In the “experiment” (as Jupyter notebook) you can find on this Github repository, I’ve defined a pipeline for a One-Vs-Rest categorization method, using Word2Vec  (implemented by Gensim), which is much more effective than a standard bag-of-words or Tf-Idf approach, and LSTM neural networks (modeled with Keras with Theano/GPU support – See https://goo.gl/YWn4Xj for an example written by its author, using an Embedding layer instead of Word2Vec). The pipeline is based on the following steps (just like a sentiment analysis approach):

  1. Category and document acquisition (I suggest to see the full code on Github). However, I’ve used BeautifulSoup in order to parse all SGML files, removing all unwanted tags and a simple regex in order to strip the ending signature.
  2. Tokenizing (with Wordnet lemmatization and stop-words filtering, both implemented by NLTK framework)# Load stop-words
    # Load stop-words
    stop_words = set(stopwords.words('english'))
    
    # Initialize tokenizer
    # It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer
    tokenizer = RegexpTokenizer('['a-zA-Z]+')
    
    # Initialize lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Tokenized document collection
    newsline_documents = []
    
    def tokenize(document):
     words = []
    
     for sentence in sent_tokenize(document):
     tokens = [lemmatizer.lemmatize(t.lower()) for t in tokenizer.tokenize(sentence) if t.lower() not in stop_words]
     words += tokens
    
     return words
    
  3. Word2Vec training
    # Create new Gensim Word2Vec model
    w2v_model = Word2Vec(newsline_documents, size=num_features, min_count=1, window=10, workers=cpu_count())
    w2v_model.init_sims(replace=True)
    w2v_model.save(data_folder + 'reuters.word2vec')
  4. Word2Vec conversion
    num_categories = len(selected_categories)
    X = zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(float32)
    Y = zeros(shape=(number_of_documents, num_categories)).astype(float32)
    
    empty_word = zeros(num_features).astype(float32)
    
    for idx, document in enumerate(newsline_documents):
        for jdx, word in enumerate(document):
            if jdx == document_max_num_words:
                break
                
            else:
                if word in w2v_model:
                    X[idx, jdx, :] = w2v_model[word]
                else:
                    X[idx, jdx, :] = empty_word
    
    for idx, key in enumerate(document_Y.keys()):
        Y[idx, :] = document_Y[key]
  5. LSTM network training
    model = Sequential()
    
    model.add(LSTM(int(document_max_num_words*1.5), input_shape=(document_max_num_words, num_features)))
    model.add(Dropout(0.3))
    model.add(Dense(num_categories))
    model.add(Activation('sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    # Train model
    model.fit(X_train, Y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, Y_test))
    
    # Evaluate model
    score, acc = model.evaluate(X_test, Y_test, batch_size=128)
        
    print('Score: %1.4f' % score)
    print('Accuracy: %1.4f' % acc)

I’ve tested different neural models and I got the best results with an LSTM layer of 150 cells (1.5 x number of words), a dropout of 0.3-0.4 and “Adam” optimizer. A batch size less than 64 can cause a slow down in the learning rate. In my experiments, 128 is a very good value. After 5 epochs (they should be incremented in a real-world scenario), I have these results:

Train on 15104 samples, validate on 6474 samples
Epoch 1/5
15104/15104 [==============================] - 38s - loss: 0.5610 - acc: 0.7168 - val_loss: 0.6814 - val_acc: 0.5828
Epoch 2/5
15104/15104 [==============================] - 40s - loss: 0.5994 - acc: 0.6597 - val_loss: 0.4435 - val_acc: 0.8230
Epoch 3/5
15104/15104 [==============================] - 37s - loss: 0.3949 - acc: 0.8477 - val_loss: 0.3765 - val_acc: 0.8557
Epoch 4/5
15104/15104 [==============================] - 38s - loss: 0.3567 - acc: 0.8707 - val_loss: 0.3415 - val_acc: 0.8735
Epoch 5/5
15104/15104 [==============================] - 37s - loss: 0.3408 - acc: 0.8761 - val_loss: 0.3269 - val_acc: 0.8837
6474/6474 [==============================] - 5s     
Score: 0.3269
Accuracy: 0.8837

 

Reference:

Photo credit: Old news ! via photopin (license)

See also:

Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Networks – Giuseppe Bonaccorso

Word2Vec ( https://code.google.com/archive/p/word2vec/) offers a very interesting alternative to classical NLP based on term-frequency matrices. In particular, as each word is embedded into a high-dimensional vector, it’s possible to consider a sentence like a sequence of points that determine an implicit geometry.