Word2Vec (https://code.google.com/archive/p/word2vec/) offers a very interesting alternative to classical NLP based on term-frequency matrices. In particular, as each word is embedded into a high-dimensional vector, it’s possible to consider a sentence like a sequence of points that determine an implicit geometry. For this reason, the idea of considering 1D convolutional classifiers (usually very efficient with images) became a concrete possibility.
As you know, a convolutional network trains its kernels so to be able to capture, initially coarse-grained features (like the orientation), and while the kernel-size decreases, more and more detailed elements (like eyes, wheels, hands and so forth). In the same way, a 1D convolution works on 1-dimensional vectors (in general they are temporal sequences), extracting pseudo-geometric features.
The rationale is provided by the Word2Vec algorithm: as the vectors are “grouped” according to a semantic criterion so that two similar words have very close representations, a sequence can be considered as a piecewise function, whose “shape” has a strong relationship with the semantic components. In the previous image, two sentences are considered as vectorial sums:
- v1: “Good experience”
- v2: “Bad experience”
As it’s possible to see, the resulting vectors have different directions, because the words “good” and “bad” have opposite representations. This condition allows “geometrical” language manipulations that are quite similar to what happens in an image convolutional network, allowing to achieve results that can outperform standard Bag-of-words methods (like Tf-Idf).
To test this approach, I’ve used the Twitter Sentiment Analysis Dataset (http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip) which is made of about 1.400.000 labeled tweets. The dataset is quite noisy and the overall validation accuracy of many standard algorithms is always about 75%.
For the Word2Vec there are some alternative scenarios:
- Gensim (the best choice in the majority of cases) – https://radimrehurek.com/gensim/index.html
- Custom implementations based on NCE (Noise Contrastive Estimation) or Hierarchical Softmax. They are quite easy to implement with Tensorflow, but they need an extra effort which is often not necessary
- An initial embedding layer. This approach is the simplest, however, the training performances are worse because the same network has to learn good word representations and, at the same time, optimize its weights to minimize the output cross-entropy.
I’ve preferred to train a Gensim Word2Vec model with a vector size equal to 512 and a window of 10 tokens. The training set is made up of 1.000.000 tweets and the test set by 100.000 tweets. Both sets are shuffled before all epochs. As the average length of a tweet is about 11 tokens (with a maximum of 53), I’ve decided to fix the max length equal to 15 tokens (of course this value can be increased, but for the majority of tweets the convolutional network input will be padded with many blank vectors).
In the following figure, there’s a schematic representation of the process starting from the word embedding and continuing with some 1D convolutions:
The whole code (copied into this GIST and also available in the repository: https://github.com/giuseppebonaccorso/twitter_sentiment_analysis_word2vec_convnet) is:
The training has been stopped by the Early Stopping callback after the twelfth iteration when the validation accuracy is about 79.4% with a validation loss of 0.44.
Possible improvements and/or experiments I’m going to try are:
- Different word vector size (I’ve already tried with 128 and 256, but I’d like to save more memory)
- Embedding layer
- Average and/or max pooling to reduce the dimensionality
- Different architectures
The previous model has been trained on a GTX 1080 in about 40 minutes.
Reuters-21578 is a collection of about 20K news-lines (see reference for more information, downloads and copyright notice), structured using SGML and categorized with 672 labels. They are diveded into five main categories: However, most of them are unused and, looking at the distribution, it’s possible to notice a complete lack of homogeneity.