Reuters-21578 text classification with Gensim and Keras

Reuters-21578 is a collection of about 20K news-lines (see reference for more information, downloads and copyright notice), structured using SGML and categorized with 672 labels. They are diveded into five main categories:

  • Topics
  • Places
  • People
  • Organizations
  • Exchanges

 

However, most of them are unused and, looking at the distribution, it's possible to notice a complete lack of homogeneity. These are the 20 top categories (the prefix is made with the two initial letter of each main category) with the number of related news-lines:

ID Name Category Newslines
161 pl_usa Places 12542
533 to_earn Topics 3987
498 to_acq Topics 2448
158 pl_uk Places 1489
84 pl_japan Places 1138
31 pl_canada Places 1104
571 to_money-fx Topics 801
526 to_crude Topics 634
543 to_grain Topics 628
167 pl_west-germany Places 567
624 to_trade Topics 552
553 to_interest Topics 513
56 pl_france Places 469
185 or_ec Organizations 349
23 pl_brazil Places 332
628 to_wheat Topics 306
606 to_ship Topics 305
10 pl_australia Places 270
517 to_corn Topics 254
37 pl_china Places 223

 

In the "experiment" (as Jupyter notebook) you can find on this Github repository, I've defined a pipeline for a One-Vs-Rest categorization method, using Word2Vec  (implemented by Gensim), which is much more effective than a standard bag-of-words or Tf-Idf approach, and LSTM neural networks (modeled with Keras with Theano/GPU support - See https://goo.gl/YWn4Xj for an example written by its author, using an Embedding layer instead of Word2Vec). The pipeline is based on the following steps (just like a sentiment analysis approach):

  1. Category and document acquisition (I suggest to see the full code on Github). However, I've used BeautifulSoup in order to parse all SGML files, removing all unwanted tags and a simple regex in order to strip the ending signature.
  2. Tokenizing (with Wordnet lemmatization and stop-words filtering, both implemented by NLTK framework)# Load stop-words
  3. Word2Vec training
  4. Word2Vec conversion
  5. LSTM network training

I've tested different neural models and I got the best results with an LSTM layer of 150 cells (1.5 x number of words), a dropout of 0.3-0.4 and "Adam" optimizer. A batch size less than 64 can cause a slow down in the learning rate. In my experiments, 128 is a very good value. After 5 epochs, I have these results:

 

Reference:

 

Photo credit: Old news ! via photopin (license)

Share it!
Aenean mattis venenatis
AI Software engineer, Data scientist and consultant

Related Articles

Leave a Reply