Reuters-21578 is a collection of about 20K news lines (see reference for more information, downloads, and copyright notice), structured using SGML and categorized with 672 labels. They are divided into five main categories:
-
- Topics
- Places
- People
- Organizations
- Exchanges
However, most of them are unused, and looking at the distribution, it’s possible to notice a complete lack of homogeneity. These are the 20 top categories (the prefix is made with the two initial letters of each main category) with the number of related news-lines:
ID | Name | Category | Newslines |
---|---|---|---|
161 | pl_usa | Places | 12542 |
533 | to_earn | Topics | 3987 |
498 | to_acq | Topics | 2448 |
158 | pl_uk | Places | 1489 |
84 | pl_japan | Places | 1138 |
31 | pl_canada | Places | 1104 |
571 | to_money-fx | Topics | 801 |
526 | to_crude | Topics | 634 |
543 | to_grain | Topics | 628 |
167 | pl_west-germany | Places | 567 |
624 | to_trade | Topics | 552 |
553 | to_interest | Topics | 513 |
56 | pl_france | Places | 469 |
185 | or_ec | Organizations | 349 |
23 | pl_brazil | Places | 332 |
628 | to_wheat | Topics | 306 |
606 | to_ship | Topics | 305 |
10 | pl_australia | Places | 270 |
517 | to_corn | Topics | 254 |
37 | pl_china | Places | 223 |
In the “experiment” (as Jupyter notebook) you can find on this Github repository, I’ve defined a pipeline for a One-Vs-Rest categorization method using Word2Vec (implemented by Gensim), which is much more effective than a standard bag-of-words or Tf-Idf approach, and LSTM neural networks (modeled with Keras with Theano/GPU support). The pipeline is based on the following steps (just like a sentiment analysis approach):
-
- Category and document acquisition (I suggest seeing the complete code on Github). However, I’ve used BeautifulSoup to parse all SGML files, removing all unwanted tags and a simple regex to strip the ending signature.
- Tokenizing (with Wordnet lemmatization and stop-words filtering, both implemented by NLTK framework)# Load stop-word
# Load stop-words stop_words = set(stopwords.words('english')) # Initialize tokenizer # It's also possible to try with a stemmer or to mix a stemmer and a lemmatizer tokenizer = RegexpTokenizer('['a-zA-Z]+') # Initialize lemmatizer lemmatizer = WordNetLemmatizer() # Tokenized document collection newsline_documents = [] def tokenize(document): words = [] for sentence in sent_tokenize(document): tokens = [lemmatizer.lemmatize(t.lower()) for t in tokenizer.tokenize(sentence) if t.lower() not in stop_words] words += tokens return words
Word2Vec training:
# Create new Gensim Word2Vec model w2v_model = Word2Vec(newsline_documents, size=num_features, min_count=1, window=10, workers=cpu_count()) w2v_model.init_sims(replace=True) w2v_model.save(data_folder + 'reuters.word2vec')
Word2Vec conversion
num_categories = len(selected_categories) X = zeros(shape=(number_of_documents, document_max_num_words, num_features)).astype(float32) Y = zeros(shape=(number_of_documents, num_categories)).astype(float32) empty_word = zeros(num_features).astype(float32) for idx, document in enumerate(newsline_documents): for jdx, word in enumerate(document): if jdx == document_max_num_words: break else: if word in w2v_model: X[idx, jdx, :] = w2v_model[word] else: X[idx, jdx, :] = empty_word for idx, key in enumerate(document_Y.keys()): Y[idx, :] = document_Y[key]
LSTM network training
model = Sequential() model.add(LSTM(int(document_max_num_words*1.5), input_shape=(document_max_num_words, num_features))) model.add(Dropout(0.3)) model.add(Dense(num_categories)) model.add(Activation('sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Train model model.fit(X_train, Y_train, batch_size=128, nb_epoch=5, validation_data=(X_test, Y_test)) # Evaluate model score, acc = model.evaluate(X_test, Y_test, batch_size=128) print('Score: %1.4f' % score) print('Accuracy: %1.4f' % acc)
I’ve tested different neural models and got the best results with an LSTM layer of 150 cells (1.5 x number of words), a dropout of 0.3-0.4, and an “Adam” optimizer. A batch size of less than 64 can cause a slowdown in the learning rate. In my experiments, 128 is an excellent value. After five epochs (they should be incremented in a real-world scenario), I have these results:
Train on 15104 samples, validate on 6474 samples Epoch 1/5 15104/15104 [==============================] - 38s - loss: 0.5610 - acc: 0.7168 - val_loss: 0.6814 - val_acc: 0.5828 Epoch 2/5 15104/15104 [==============================] - 40s - loss: 0.5994 - acc: 0.6597 - val_loss: 0.4435 - val_acc: 0.8230 Epoch 3/5 15104/15104 [==============================] - 37s - loss: 0.3949 - acc: 0.8477 - val_loss: 0.3765 - val_acc: 0.8557 Epoch 4/5 15104/15104 [==============================] - 38s - loss: 0.3567 - acc: 0.8707 - val_loss: 0.3415 - val_acc: 0.8735 Epoch 5/5 15104/15104 [==============================] - 37s - loss: 0.3408 - acc: 0.8761 - val_loss: 0.3269 - val_acc: 0.8837 6474/6474 [==============================] - 5s Score: 0.3269 Accuracy: 0.8837
Reference
-
- Dataset: https://kdd.ics.uci.edu/databases/reuters21578/
- Dataset info: http://kdd.ics.uci.edu/databases/reuters21578/README.txt
- Gensim Word2Vec: https://radimrehurek.com/gensim/models/word2vec.html
- Word2Vec Tutorial: http://radimrehurek.com/2014/02/word2vec-tutorial/
- Word2Vec and Doc2Vec models: http://arxiv.org/pdf/1405.4053v2.pdf
- LSTM: http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf