1. Jack


    I tried your code on sentiment140 data set with 500,000 tweets for training and the rest for testing. I get about the same result as you on the validation set but when I use my generated model weights for testing, I get about 55% accuracy at best. Do you know what could be causing it?

    • In a deep model, the train size should be very large (sometimes also 95% of set). I think you’re excluding many elements. Which is your training accuracy? If it’s quite higher than the validation acc, you’re overfitting. Try with a larger training set and a smaller for testing.

      • Jack

        Hey thanks for your reply! No, my training accuracy is not too high as compared to validation accuracy. Here is my testing code https://pastebin.com/cs3VJgeh
        I just noticed that I am also creating a new word2vec when tesing. Do you think that could be a problem? Should I try and save my word2vec model while training and reuse it when testing?

        • Hi,
          I cannot reproduce your code right now, however you must use the same gensim model. The word embeddings could be completely different due to the random initializations. Moreover you can lose the correspondence between word embedding and initial dictionary.

          • Jack

            Yeah, I figured the same. I will give this a shot and get back to you. Thanks!

  2. Henson

    Hi Giuseppe,

    Thanks for making this great post. I have a question –

    On line 76, you create a word2vec object by putting in the entire tokenized corpus through the function. Then, from line 119 you perform the train-test split.

    From my understanding, word2vec creates word vectors by looking at every word in the corpus (which we haven’t split yet). So in effect, your model could be biased as it has already “seen” the test data, because words that ultimately ended up in the test set influenced the ones in the training set.

    Can you explain why this is so? Please correct me if I’m wrong, but I’m a little confused here.

    • Hi Henson,

      thanks for your comment.

      The word2vec phase, in this case, is a preprocessing stage (like Tf-Idf), which transforms tokens into feature vectors. Of course, its complexity is higher and the cosine similarity of synonyms should be very high. Moreover, they are prone to be analyzed using 1D convolutions when concatenated into sentences.

      However, the model itself (not word2vec) uses these feature vectors to determine if a sentence has whether a positive or negative sentiment and this result is determined by many factors which work at sentence-level. The initial transformation can also be done in the same model (using and Embedding layer), but the process is slower.

      On the other side, word2vec has to “know” also the test words, just like any other NLP method, in order to build a complete dictionary. If you exclude them, you can’t predict with never-seen words. You’re correct when you say that they influence each other, but the skip-gram model considers the context, not the final classification.

      In other words: “Paris”, “London” and “city” can be close (in terms of cosine similarity), but it doesn’t mean that they can directly affect the sentiment analysis. Maybe there’s a sentence saying: “I love the city of Paris” (positive sentiment) and another saying “I hate London. It’s a messy city” (negative sentiment) (I love both :D).
      So, I don’t think about a bias. Maybe the model could be improved in terms of capacity, but it doesn’t show either a high bias or high variance.

      I hope my viewpoint was clear. It’s a very interesting conversation!

  3. alessandro

    Hi Giuseppe,
    I have a question, no pre-trained glove model is used on which to create the word2vec of the whole training set?

  4. Ayra

    Hi,i have few questions and since i am new to this they might be basic so sorry in advance.
    1-I am getting “Memory error” on line 114,is it hardware issue or am i doing something wrong in code?
    2-line number 33,what does it refer to?
    3-If i train my model with this dataset and then want to predict for the dataset which are still tweets but related to some specific brand,would it still make sense in your opinion?
    4-If i want to add LSTM (output from the CNN goes into LSTM for final classification),do you think it can improve results?If yes,can you guide a bit how to continue with your code to add that part?Thanks alot!

    • Hi,
      1. The dataset is huge and you probably don’t have enough free memory. Try to reduce the train size. All my tests have been done with 32GB
      2. A folder where you want to store the Gensim model so to avoid retraining every time
      3. You should consider the words which are included in the production dataset. If they are very specific, it’s better to include a set of examples in the training set, or using a Word2Vec/GloVe/FastText pretrained model (there are many based on the whole Wikipedia corpus).
      4. I’m still working on some improvements, however, in this case, the idea is to use the convolutions on the whole utterance (which is not considered like an actual sequence even if a Conv1D formally operates on a sequence), trying to discover the “geometrical” relationships that determine the semantics. You can easily try adding an LSTM layer before the dense layers (without flattening). In this case, the input will have a shape (batch_size, timesteps, last_num_filters). The LSTM output can be processed by one or more dense layers.

      • Ayra

        Thanks alot!I am trying to go through line by line to understand the code and i had my concepts build in terms of images so understanding in terms of 1D text is a bit new for me.So I have a few more questions since i am confused a bit:
        1-As far as i can understand word2vec model is trained till like line 87,after that,the separation of training and test data is for CNN ,is my understanding right?

        2-I wanted to run and see what exactly X_train looks like but i couldnt run it so i am assuming from dry run that its a matrix containing index,words and their corresponding vectors.If my understanding is right,then it means CNN takes 15 words as an input each time(which might or might not be the whole tweet) so when i make predictions how will it make sure that prediction is for one whole tweet?

        3-I was thinking to use another dataset as well which is similar to one for which i want to make predictions for(e.g.phones) for training word2vec since it doesnt need labelled data and it will probably just increase dictionary.But i am concerned that CNN or CNN+LSTM wont be able to learn anything since i couldnt find any labelled dataset related to phones so if someone says there camera is 2MP vs someone who says 30MP,it wont be able to differentiate that the 2MP one is probably negative sentiment and 30 one is positive.Do you think that i should try to make predictions only if i have labelled dataset for that particular domain?

        4-In LSTM timestamp according to me is how many previous steps you would want to consider before making next prediction,which ideally is all the words of one tweet(to see the whole context of the tweet) so in this case would it be 1?since CNN takes 15 words which is almost one tweet.Last_num_filters i think is based on feature map or filters that you have used in CNN so e.g. if in your code you did 8,would this be 8?

        Sorry for really lengthy post and hope i make some sense atleast.

  5. kaz

    This post is really interesting!
    I am a beginner in the field of machine learning and I’ve been trying to understand this code. I would like to know how can we predict the sentiment of a fresh tweet/statement using this model.
    It’ll be really helpful if you could attach the code too!


    • Hi,
      thanks a lot for your comment! Of course, you can work with new tweets. What you should do is similar to this part:

      for i, index in enumerate(indexes):
      for t, token in enumerate(tokenized_corpus[index]):
      if t >= max_tweet_length:

          if token not in X_vecs:
          if i < train_size:
              X_train[i, t, :] = X_vecs[token]
              X_test[i - train_size, t, :] = X_vecs[token]


      In other words, you need first to tokenize the tweet, then lookup for the word vectors corresponding to each token.
      However, I’m planning to post a new article based on FastText and I’m going to add a specific section for querying the model.

  6. ahmed

    error in line 116
    MemoryError Traceback (most recent call last)
    in ()
    2 indexes = set(np.random.choice(len(tokenized_corpus), train_size + test_size, replace=False))
    —-> 4 X_train = np.zeros((train_size, max_tweet_length, vector_size), dtype=K.floatx())
    5 Y_train = np.zeros((train_size, 2), dtype=np.int32)
    6 X_test = np.zeros((test_size, max_tweet_length, vector_size), dtype=K.floatx())

    please hellp
    i tried to reduce trsining and test data to 750000 or even 100000 and didnot work

    • Hi,
      unfortunately, I can’t help you. You don’t enough free memory. Try to reset the notebook (if using Jupyter) after reducing the number of samples. You can also reduce the max_tweet_length and the vector size. Consider that I worked with 32 GB but many people successfully trained the model with 16 GB.

  7. Ahmed

    Hi .. it’s worked with 100.000sample but very slow .. I have another question .. how can I fed a new review to get it’s sentiment predict ?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.