BBC News dataset (available for download in Insight Project Resources website) is made up of 2225 newslines classified into 5 categories (Politics, Sport, Entertainment, Tech, Business) and, similarly to Reuters-21578, it can be adopted in order to test both the efficacy and the efficiency of different classification strategies.
In the repository: https://github.com/giuseppebonaccorso/bbc_news_classification_comparison, I’ve committed a Jupyter (IPython) notebook (based on Scikit-Learn, NLTK, Gensim, Keras (with Theano or Tensorflow)) where I’ve collected some experiments aimed at comparing four different algorithms:
- Multinomial Naive-Bayes with Count (TF) vectorizer
- Multinomial Naive-Bayes with TF-IDF vectorizer
- SVM (linear and kernelized) with Doc2Vec (Gensim-based) vectorization
- MLP (Keras-based) with Doc2Vec vectorization
(Every experiment has been performed on tokens without stop-words and processed with WordNet lemmatization).
As expected (thanks to several research projects – see references for further information), Naive-Bayes performs even better than the other strategies, in particular comparing its simplicity and naiveness with the complexity of SVM and neural networks. Intuitively, a Naive-Bayes classifier determines a decision threshold according to observed data likelihood and, in particular, adopting a multinomial distribution allows to compute the probability of a certain sequence of tokens to belong to a specific class (under the assumption of conditional independence), starting from single likelihoods which are directly connected to token frequencies. It simply means that a machine-learning algorithm can learn in a way that is not so different to the criterion adopted by human beings when they read a news for the first time and they’re asked to classify it.
For example, if we start reading: “Football club…”, the probability of class “Sports” is immediately pushed up by the likelihood of tokens like “football”, “club”, etc., however, if we continue reading: “…stock options, trading, market, stock exchange”, we’re driven to think that the main topic is likely to be related to the class “Business” even if there’s something belonging to “Sport” that reduces the overall probability.
On the other side, Word2Vec (and Doc2Vec) algorithms adopt neural networks in order to learn the inner relationships among words and compute a vector space with a metric isomorphic to some sort of “probabilistic semantic”. In other words, the assumption is that all texts are structured with an intrinsic correlation among close words. In this way, it’s possible to learn a model which minimizes a loss function that represents the “distance” among tokens and the same metric used to measure the distance between two vectors (points in a hyperspace), does the same with the semantic induced by the corpus.
The summary of performance metrics (obtained after a Grid Search of optimal hyperparameters, even if, for example, MLPs are very sensitive to their architecture, activation functions, optimizers, etc., so it’s always a good idea trying with many algorithms, batch sizes and epochs, so to find out which combination best matches the requirements) is:
Multinomial Naive-Bayes with Count (TF) vectorizer:
Precision score: 0.987433 (micro) / 0.986406 (macro) Recall score: 0.987433 (micro) / 0.987845 (macro) F1 score: 0.987433 (micro) / 0.986997 (macro)
Multinomial Naive-Bayes with TF-IDF vectorizer:
Precision score: 0.980251 (micro) / 0.978598 (macro) Recall score: 0.980251 (micro) / 0.980466 (macro) F1 score: 0.980251 (micro) / 0.979228 (macro)
SVM (linear and kernelized) with Doc2Vec (Gensim-based) vectorization:
Doc2Vec models are sensitive to many parameters and a long time is needed to perform a deep grid search. However, comparing the results, I think that SVMs and MLPs perform in the comparable ways. In  you can find a theoretical proof of Naive-Bayes optimality under certain conditions. However, on the basis of my own experience, thanks to their high non-linearity, kernelized SVMs or MLPs show a more accurate ability to generalize when the resulting conditional dependence among parameters is higher.
Precision score: 0.987433 (micro) / 0.987537 (macro) Recall score: 0.987433 (micro) / 0.987811 (macro) F1 score: 0.987433 (micro) / 0.987666 (macro)
MLP (Keras-based) with Doc2Vec (Gensim-based) vectorization:
Precision score: 0.985637 (micro) / 0.984715 (macro) Recall score: 0.985637 (micro) / 0.986193 (macro) F1 score: 0.985637 (micro) / 0.985397 (macro)
- D. Greene and P. Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”, Proc. ICML 2006
- Metsis, Vangelis, Ion Androutsopoulos, and Georgios Paliouras. “Spam Filtering with Naive Bayes-Which Naive Bayes?”, CEAS, 27–28, 2006
- Zhang, Harry. “The Optimality of Naive Bayes.”, AA 1, no. 2 (2004): 3
- Le, Quoc V., and Tomas Mikolov. “Distributed Representations of Sentences and Documents”, ICML, 14:1188–1196, 2014
- Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space”, arXiv Preprint arXiv:1301.3781, 2013
- Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. “Distributed Representations of Words and Phrases and Their Compositionality”, Advances in Neural Information Processing Systems, 3111–3119, 2013
- Bonaccorso G., Reuters-21578-Classification using Word2Vec and LSTM
Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Networks – Giuseppe Bonaccorso
Word2Vec ( https://code.google.com/archive/p/word2vec/) offers a very interesting alternative to classical NLP based on term-frequency matrices. In particular, as each word is embedded into a high-dimensional vector, it’s possible to consider a sentence like a sequence of points that determine an implicit geometry.