1. Jack


    I tried your code on sentiment140 data set with 500,000 tweets for training and the rest for testing. I get about the same result as you on the validation set but when I use my generated model weights for testing, I get about 55% accuracy at best. Do you know what could be causing it?

    • In a deep model, the train size should be very large (sometimes also 95% of set). I think you’re excluding many elements. Which is your training accuracy? If it’s quite higher than the validation acc, you’re overfitting. Try with a larger training set and a smaller for testing.

      • Jack

        Hey thanks for your reply! No, my training accuracy is not too high as compared to validation accuracy. Here is my testing code https://pastebin.com/cs3VJgeh
        I just noticed that I am also creating a new word2vec when tesing. Do you think that could be a problem? Should I try and save my word2vec model while training and reuse it when testing?

        • Hi,
          I cannot reproduce your code right now, however you must use the same gensim model. The word embeddings could be completely different due to the random initializations. Moreover you can lose the correspondence between word embedding and initial dictionary.

          • Jack

            Yeah, I figured the same. I will give this a shot and get back to you. Thanks!

  2. Henson

    Hi Giuseppe,

    Thanks for making this great post. I have a question –

    On line 76, you create a word2vec object by putting in the entire tokenized corpus through the function. Then, from line 119 you perform the train-test split.

    From my understanding, word2vec creates word vectors by looking at every word in the corpus (which we haven’t split yet). So in effect, your model could be biased as it has already “seen” the test data, because words that ultimately ended up in the test set influenced the ones in the training set.

    Can you explain why this is so? Please correct me if I’m wrong, but I’m a little confused here.

    • Hi Henson,

      thanks for your comment.

      The word2vec phase, in this case, is a preprocessing stage (like Tf-Idf), which transforms tokens into feature vectors. Of course, its complexity is higher and the cosine similarity of synonyms should be very high. Moreover, they are prone to be analyzed using 1D convolutions when concatenated into sentences.

      However, the model itself (not word2vec) uses these feature vectors to determine if a sentence has whether a positive or negative sentiment and this result is determined by many factors which work at sentence-level. The initial transformation can also be done in the same model (using and Embedding layer), but the process is slower.

      On the other side, word2vec has to “know” also the test words, just like any other NLP method, in order to build a complete dictionary. If you exclude them, you can’t predict with never-seen words. You’re correct when you say that they influence each other, but the skip-gram model considers the context, not the final classification.

      In other words: “Paris”, “London” and “city” can be close (in terms of cosine similarity), but it doesn’t mean that they can directly affect the sentiment analysis. Maybe there’s a sentence saying: “I love the city of Paris” (positive sentiment) and another saying “I hate London. It’s a messy city” (negative sentiment) (I love both :D).
      So, I don’t think about a bias. Maybe the model could be improved in terms of capacity, but it doesn’t show either a high bias or high variance.

      I hope my viewpoint was clear. It’s a very interesting conversation!

  3. alessandro

    Hi Giuseppe,
    I have a question, no pre-trained glove model is used on which to create the word2vec of the whole training set?

  4. Ayra

    Hi,i have few questions and since i am new to this they might be basic so sorry in advance.
    1-I am getting “Memory error” on line 114,is it hardware issue or am i doing something wrong in code?
    2-line number 33,what does it refer to?
    3-If i train my model with this dataset and then want to predict for the dataset which are still tweets but related to some specific brand,would it still make sense in your opinion?
    4-If i want to add LSTM (output from the CNN goes into LSTM for final classification),do you think it can improve results?If yes,can you guide a bit how to continue with your code to add that part?Thanks alot!

    • Hi,
      1. The dataset is huge and you probably don’t have enough free memory. Try to reduce the train size. All my tests have been done with 32GB
      2. A folder where you want to store the Gensim model so to avoid retraining every time
      3. You should consider the words which are included in the production dataset. If they are very specific, it’s better to include a set of examples in the training set, or using a Word2Vec/GloVe/FastText pretrained model (there are many based on the whole Wikipedia corpus).
      4. I’m still working on some improvements, however, in this case, the idea is to use the convolutions on the whole utterance (which is not considered like an actual sequence even if a Conv1D formally operates on a sequence), trying to discover the “geometrical” relationships that determine the semantics. You can easily try adding an LSTM layer before the dense layers (without flattening). In this case, the input will have a shape (batch_size, timesteps, last_num_filters). The LSTM output can be processed by one or more dense layers.

      • Ayra

        Thanks alot!I am trying to go through line by line to understand the code and i had my concepts build in terms of images so understanding in terms of 1D text is a bit new for me.So I have a few more questions since i am confused a bit:
        1-As far as i can understand word2vec model is trained till like line 87,after that,the separation of training and test data is for CNN ,is my understanding right?

        2-I wanted to run and see what exactly X_train looks like but i couldnt run it so i am assuming from dry run that its a matrix containing index,words and their corresponding vectors.If my understanding is right,then it means CNN takes 15 words as an input each time(which might or might not be the whole tweet) so when i make predictions how will it make sure that prediction is for one whole tweet?

        3-I was thinking to use another dataset as well which is similar to one for which i want to make predictions for(e.g.phones) for training word2vec since it doesnt need labelled data and it will probably just increase dictionary.But i am concerned that CNN or CNN+LSTM wont be able to learn anything since i couldnt find any labelled dataset related to phones so if someone says there camera is 2MP vs someone who says 30MP,it wont be able to differentiate that the 2MP one is probably negative sentiment and 30 one is positive.Do you think that i should try to make predictions only if i have labelled dataset for that particular domain?

        4-In LSTM timestamp according to me is how many previous steps you would want to consider before making next prediction,which ideally is all the words of one tweet(to see the whole context of the tweet) so in this case would it be 1?since CNN takes 15 words which is almost one tweet.Last_num_filters i think is based on feature map or filters that you have used in CNN so e.g. if in your code you did 8,would this be 8?

        Sorry for really lengthy post and hope i make some sense atleast.

  5. kaz

    This post is really interesting!
    I am a beginner in the field of machine learning and I’ve been trying to understand this code. I would like to know how can we predict the sentiment of a fresh tweet/statement using this model.
    It’ll be really helpful if you could attach the code too!


    • Hi,
      thanks a lot for your comment! Of course, you can work with new tweets. What you should do is similar to this part:

      for i, index in enumerate(indexes):
      for t, token in enumerate(tokenized_corpus[index]):
      if t >= max_tweet_length:

          if token not in X_vecs:
          if i < train_size:
              X_train[i, t, :] = X_vecs[token]
              X_test[i - train_size, t, :] = X_vecs[token]


      In other words, you need first to tokenize the tweet, then lookup for the word vectors corresponding to each token.
      However, I’m planning to post a new article based on FastText and I’m going to add a specific section for querying the model.

  6. ahmed

    error in line 116
    MemoryError Traceback (most recent call last)
    in ()
    2 indexes = set(np.random.choice(len(tokenized_corpus), train_size + test_size, replace=False))
    —-> 4 X_train = np.zeros((train_size, max_tweet_length, vector_size), dtype=K.floatx())
    5 Y_train = np.zeros((train_size, 2), dtype=np.int32)
    6 X_test = np.zeros((test_size, max_tweet_length, vector_size), dtype=K.floatx())

    please hellp
    i tried to reduce trsining and test data to 750000 or even 100000 and didnot work

    • Hi,
      unfortunately, I can’t help you. You don’t enough free memory. Try to reset the notebook (if using Jupyter) after reducing the number of samples. You can also reduce the max_tweet_length and the vector size. Consider that I worked with 32 GB but many people successfully trained the model with 16 GB.

  7. Ahmed

    Hi .. it’s worked with 100.000sample but very slow .. I have another question .. how can I fed a new review to get it’s sentiment predict ?

  8. zahra

    I wanna train your model in NON ENGLISH language so I have a couple of questions, I would appreciate if you help me,
    1- when I trained your model in my own NON ENGLISH corpus, I got unicode error so I tried to fix it with utf8 but it doesn’t work anymore, Do you have any idea to solve it?

    2- I wanna know whether your word2vec model works properly in my own English corpus or not Is there any code to show word2vec output vector to me??
    i gonna use word2vec.save(‘file.model’) but when I open it the file contain doesn’t seem meaningful and doesn’t have any vectors

    thanks alot

    • Hi,
      Word2Vec works with any language. If you are experiencing issues, they are probably due to the charset. Unfortunately, I can’t help you, but encode(‘utf8’) and decode(‘utf8’) on the strings should solve the problem. Alternatively, try other charsets, like ISO-8859-1.

      The model is binary, so it doesn’t make sense to try and read it. If you want to test it (considering the variable names used in the example), you can try X_vecs.similarity(‘word1’, ‘word2’) with a couple of test words (e.g. ‘king’ and ‘queen’). Instead, the word vectors can be retrieved as in a standard dictionary: X_vecs[‘word’]. Clearly, check whether the term exists before trying to read the vector.

  9. zahra

    Hi, i have some questions.

    -In you’re code when I write print(tokens) to see the result of tokenized process I face some strange result, say this sentence for example:

    .. Omgaga. Im sooo im gunna CRy. I’ve been at this dentist since 11.. I was suposed 2 just get a crown put on (30mins)….

    as you know, this is a tweet from you’re corpus and here is the result:

    [‘omgag’, ‘im’, ‘sooo’, ‘im’, ‘gunn’, ‘cry’, ‘i’, ‘ve’, ‘been’, ‘at’, ‘thi’, ‘dent’, ‘sint’, ’11’, ‘i’, ‘was’, ‘supos’, ‘2’, ‘just’, ‘get’, ‘a’, ‘crown’, ‘put’, ‘on’, ’30mins’]

    and when I use nltk for tokenize the result gonna be change, here is the result with nltk:

    [‘..’, ‘Omgaga’, ‘.’, ‘Im’, ‘sooo’, ‘im’, ‘gunna’, ‘CRy’, ‘.’, ‘I’, “‘ve”, ‘been’, ‘at’, ‘this’, ‘dentist’, ‘since’, ’11..’, ‘I’, ‘was’, ‘suposed’, ‘2’, ‘just’, ‘get’, ‘a’, ‘crown’, ‘put’, ‘on’, ‘(‘, ’30mins’, ‘)’, ‘…’, ‘.’]

    and this is the related code:

    from nltk.tokenize import word_tokenize

    I can’t understand why do they have this different and i really get confused!

    2-Is it that important to have tokenize and stem as a preprocessor in sentiment analysis?? I mean can I train my model without these preprocessor, in other words get the corpus directly to the word2vec model and the result will be passed for training, is it possible??

    3-since I’m not that familiar to this field I wanna know after training the model is this any code to get my sentences as an input and show me the polarity(negative or positive) as an output?? Kinda test the model I mean,

    Thanks alot

    • When using Word2Vec, you can avoid stemming (increasing the dictionary size and reducing the generality of the words), but tokenizing is always necessary (if you don’t do it explicitly, it will be done by the model). The differences are due to different approaches (for example, a tokenizer can strip all punctuation while another can keep ‘…’ because of its potential meaning). NLTK offers different solutions and I invite you to check the documentation (this is not advertising, but if you are interested in an introduction to NLP, there are a couple of chapters in my book Machine Learning Algorithms).

      A quick solution to get the polarity is using the Vadim Sentiment Analyzer (http://www.nltk.org/howto/sentiment.html), which is a rule-based algorithm. Otherwise, you must:
      1. Tokenize the sentence (with the same method employed in the training phase)
      2. Pad or truncate it (see the code for an example)
      3. Create an array containing the vectors for each token
      4. use the method predict(…) on the Keras model to get the output

  10. bahara

    hi, i run your code in my corpus and everything was OK. but i want to know how should I predict sentiment for new tweet, say : ‘im really hungry’ for example, since i’m new to this field would you please help me to add related code for prediction? thanks

    • Hi,
      I’ve asked this question in other comments. However, you need to tokenize your sentence, creating an empty array with the maximum length employed during the training, then setting each word vector (X_vecs[word]) if present or keep it null if the word is not present in the dictionary. At this moment, I’m quite busy, but I’m going to create an explicit example soon.

      • bahara

        Really sorry, but i forgot to ask you if is it right to use “model.predict()” or not, I mean use it after those steps that you recommended before

          • bahara

            hi, with your instructions i wrote the code for prediction, but i faced a strange problem,since your max_tweet_length is 15 when i get a sentence with for example 3 length,,, say = i’m too hungry for example,,, i faced this error :

            ValueError: Error when checking input: expected conv1d_1_input to have shape (15, 512) but got array with shape (3, 512)

            to my understanding from the net, there might be something related to input shape, line 143- but i really don’t know how can i fix it, i would appreciate if you help me
            thanks alot

  11. johm

    Hi, I want to add neutral sentiment too your code- I added neutral tweets with the specific label, 2 , and changed the related code in this way:

    if i < train_size:
    if labels[index] == 0 :
    Y_train[i, :] = [1.0, 0.0]

    elif labels[index] == 1 :
    Y_train[i, :] = [0.0, 1.0]
    Y_train[i, :] = [1.0, 1.0]

    and the same for the testing – All i did was to change what i said – Is it right?

    • Hi, you also need to modify the output layer of the network. Right now it’s a softmax and [1, 1] cannot be accepted. Try using a sigmoid layer instead. Alternatively, you need to assign [0.5, 0.5] to the neutral sentiment. Softmax must represent a valid probability distribution (so the sum must be always equal to 1).

      • john

        hi thank you for your clear explanation- I did what you said –
        I assign in such way: Y_test[i – train_size, :] = [0.5, 0.5] and I although that i understood in this way i can use softmax , I use sigmoid – All I did was what i said – I didn’t add new neural or anything but the code can’t predict any neutral idea – Do you have any suggestion ??

        • Hi, with (0.5, 0.5) you should use softmax. Indeed, any output which is close to (0.5, 0.5) is implicitly a neutral. However, do you have neutral tweets? You should have a dataset made up of 33% positive, 33% negative, and 33% neutral in order to avoid biases.

          • john

            yeah my corpus consist only about 10% of neutral – I gonna make my corpus balanced but you know when i put print after this line:
            Y_train[i, :] = [0.5, 0.5]
            print(Y_train[i, :])
            i see [0 0] in the output !!!!!! Do know why??

      • john

        and i also want to know do you prefer to assign in the way i mentioned or in this way :
        Y_test[i – train_size, :] = [0.0, 0.0,0.1] consider that i do the same for positive and negative too – honestly I did that but I can’t get the properly result so I want to know whether this might some logical problem or something from my corpus ….
        with this view i just changed Y_train = np.zeros((train_size, 2), dtype=np.int32) to 3 and the same for test and change softmax to sigmoid
        thank you for your patient

  12. shim

    Great job!
    Hi – I’m new in this field so I get confused for a basic issue. Why doesn’t your model use classifier training method such as training and testing the Naive bayes Classifier?
    Is it ok to only choose randomly training and testing data set among the corpus??Why?
    Sorry if i were stupid
    thank you

    • Hi, this is a model based on word vectors that can be more efficiently managed using NN or Kernel SVM. However, you are to test any other algorithm (e.g. a Gaussian Naive Bayes) and select the solution the best meets your needs.

      In any model, the dataset is supposed to represent a data generating process, so randomly sampling from it is the optimal way to create two subsets that are close (not exactly overlapped) to the original probability distribution. A non-random choice can bias the model, by forcing it to learn only some associations while other ones are never presented (and, therefore, the relative predictions cannot be reliable).

  13. john

    Hi – I have a Question – Why do you consider 2Dim array for Y-train and Y-test?? does it have any problem to define a 1D vector and pass it for example 0 for negative and 1 for positive? in such way:
    Y_test[i – train_size, :] = [1] for positive
    Y_test[i – train_size, :] = [0] for negative

    or for example in such way:
    Y_test[i – train_size, :] = [1,1] for positive
    Y_test[i – train_size, :] = [0,0] for negative

    Does the model of initialing Y_test have any effect on the learning or what??
    thank you

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.