Recommendations and User-Profiling from Implicit Feedbacks
Recommendations and Feedbacks
The vast majority of B2C services are quickly discovering the strategic importance of solid recommendation engines to improve the conversion rates and an establish a stronger fidelity with the customers. The most common strategies are based [3] on the segmentation of users according to their personal features (age range, gender, interests, social interactions, and so on) or to the ratings they gave to specific items. The latter approach normally relies on explicit feedbacks (e.g. a rating from 0 to 10) which summarize the overall experience. Unfortunately, there are drawbacks to both cases.
Personal data are becoming harder to retrieve and the latest regulations (i.e. GDPR) allow the user to interact with a service without the collection of data. Moreover, a reliable personal profile must be built using many attributes that are often hidden and can only be inferred using predictive models. Conversely, implicit feedbacks are easy to collect but they are extremely scarce. Strategies based on the factorization of the user-item matrix [3] are easy to implement (considering the possibility to employ parallelized algorithms), but the discrepancy between the number of ratings and number of products is normally too high so to allow an accurate estimation.
Our experience showed that it’s possible to create very accurate user profiles (also for an anonymous user identified only by a random ID) exploiting the implicit feedbacks present in textual reviews. Our main assumption is that reviews which are classified as spam reflect the wishes of a user with respect to a specific experience. In other words, a simple sentence like “The place was very noisy” implies that I prefer quiet places or, adding extra constraints, that, in a specific situation (characterized by n factors), a quiet place is preferable than a noisy one. The approach we are going to present is based on this hypothesis and can be adapted in order to cope with many different scenarios.
Assumptions
We assume that a user is represented as a real-valued feature vector of fixed length Nu:
Each topic ti is represented by a label corresponding to a set of semantically coherent words, for example:
Standard topic modeling approaches (like LDA [1]) can be employed to discover the topics defined in a large set (corpus) of reviews, however, this strategy can lead to a large number of incoherent results. The main reasons are:
- LDA works with vectorizations based on frequencies or TF-IDF, therefore the words are not considered in their semantic context
- One fundamental assumption is that each topic is characterized by a small set of words which are peculiar to it
While the latter can be easily managed, the former is extremely problematic. In fact, reviews are not homogenous documents that can be trivially categorized (e.g. politics, sport, economics, and so forth). On the contrary, they are often made up of the same subset of words that, however, can acquire different meanings thanks to the relative context.
This problem can be solved with different Deep-Learning solutions (like Word2Vec [5], GloVe [6], or FastText [7]) but we have focused on Word2Vec with a Skip-Gram model.
The main idea is to train a deep neural network in order to find the best vectorial representation of a word considering the context where it’s placed in all the documents. The generic structure of the network is shown in the following figure:
Each review is split into sub-reviews of 2n words around a pivot that we feed as input. The goal of the training process is to force the network to rebuild the context given the pivot. As the vectors are rather long (100-300 32 bit float values), the process will converge to a global solution where semantically similar words are placed into very dense balls whose radius is quite smaller than the whole dictionary sub-space. Hence, for example, the word “room” will be implicitly a synonym of “chamber” and the cosine distance between the two vectors will be extremely small. As many people use jargon terms in their reviews (together with smileys or other special characters), this approach guarantees the robustness needed to evaluate the content without undesired biases. For further technical details, I suggest reading [5], however, the general dynamic is normally based on a random initialization of the vectors and a training process where the output layer is made up of softmax units representing the probability of each word. As this approach is unfeasible considering the number of words, techniques like Hierarchical Softmax or Negative Contrastive Estimations are employed as cost functions. In general, after a few iterations (10 – 50), the global representation is coherent with the underlying semantic structure.
Our model
To create the recommender, we have selected 500,000 reviews written in English and we have employed Gensim [9] to create the vectorial representation after a basic NLP preprocessing (in particular, stop-words and punctuation removal). The vector size has been chosen to be 256, so to optimize the GPU support in the next stages.
As explained before, standard LDA cannot be applied with word-vectors, but Das et al. proposed a solution [2] that allows modeling the topics as multivariate Gaussian distributions. As the authors observed, this choice offers an optimal solution for real-valued representations that can be compared using the Euclidean distance. An example is shown in the following figure:
What we observe at the end of the training is:
This is is an obvious consequence of the triangle inequality, but it’s extremely useful when working with implicit feedbacks, in fact, we can avoid being worried about misunderstood synonyms unless they are in incoherent contexts. However, assuming that the reviews are prefiltered to avoid spam, this possibility is very rare.
Gaussian LDA requires to specify a desired number of topics. In our project, we have started with 100 topics and, after a few evaluations, we have reduced the number to 10. An example of 3 topics (in the context of Berlin hotel reviews) is shown in the following table:
Room/Quality | Service | Location |
Furniture |
Receptionist | Station |
Bed | English | Airport |
… | … | … |
Cleaning | Information | Shuttle service |
TV/Satellite | Map | Alexanderplatz |
Bathroom | Directions | Downtown |
These are only a few selected words, but during our analysis, we filtered many redundant terms out. However, it’s easy to check that, for example:
In order to have a deeper understanding of the reviews and exploit them to create user profiles, we have also employed a sentiment analysis model, which is trained to perform what we have called “vestigial analysis”. In other words, we need to extract the sentiment of every agnostic word (like “Bed”) considering a small surrounding context. Let’s consider the following example:
Considering the previous example we want to create a sparse feature vector as a function of the user and the item:
Every coefficient βi represents the average sentiment (normalized between -1.0 and 1.0) for the specific topic. As each review covers only a subset of topics, we assume most βi=0. For example, in the previous case, we are considering only the keywords: room, hotel, square, and located, which refers respectively to the topics t1=”Room/Quality” and t3=”Location”. In this case, the attribute “cheap” has a neutral sentiment because this word can be used also to define poor quality hotels. The resulting vector becomes:
Once all the reviews have been processed (at the beginning the process is purely batch, but it becomes incremental during the production phase), we can obtain the user-profile feature vectors as:
The resulting vector contains an average of all topics with a weight obtained through the sentiment analysis. In particular scenarios, we have also tested the possibility to reweight the coefficients according to the global average or secondary factors. This approach allows avoiding undesired biases for topics which should be secondary.
The neural model to perform the sentiment analysis is shown in the following schema:
The dataset has been labeled selecting the keywords and their relative sentiment. The role of the “Word vector selector” is to create an “attention mechanism” that allows focusing only on the relevant parts. In particular, it received the whole sequence and the positions of the target keys and outputs a synthetic vector that is fed into the following sub-network.
In order to exploit the composition ability of word-vectors, we have implemented 1D convolutions that can evaluate the local relationships. Horizontal pooling layers didn’t boost the performances as expected and have been excluded. All convolutions are based on ReLU activations.
The output of the network is made up of 15 hyperbolic tangent units that represent the sentiment for each topic. This choice allowed to discard neutral sentiments as their value is close to zero. In our practical implementation, we have thresholded the output in order to accept absolute values greater than 0.1. This choice allowed to denoise the predictions while keeping a very high accuracy. The training has been performed using 480,000 reviews and the validation has been done on 20,000 reviews. The final validation accuracy is about 88% but deeper models with larger corpora are showing better performances. An alternative approach is based on an LSTM layer that captures the structural sequence of the review:
LSTMs are extremely helpful when the reviews are not very short. In this case, the peculiar ability to discover long-term dependencies allows processing contexts where the target keyword and the qualifying elements are not very close. For example, a sentence like: “The hotel, that we booked using the Acme service suggested by a friend of us, is really excellent” requires a more complex processing to avoid the network to remain stuck in short contexts. Seq2Seq with attention has been successfully tested, with the goal to output synthetic vectors where the most important elements are immediately processable. We have also noticed that convolutions learn much faster when they cope with standardized structures. This approach, together with other improvements, is the goal of future developments.
From User-Profiles to Recommendations
Once all the reviews have been processed, the feature vectors can be directly employed for recommendations. In particular, we have tested and implemented the following solutions:
- User clustering: in this case, the users are grouped using K-Means and Spectral Clustering (this is particularly useful as the clusters are often non-convex) at a higher level. Then, each cluster is mapped using a Ball-Tree and a k-Nearest Neighbors algorithms selected the most similar users according to a variable radius. As the feature vectors represent the comments and latent wishes, very close users can share their experiences and, hence, the engine suggests new items in a discovery section. This approach is an example of collaborative filtering that can be customized with extra parameters (real-time suggestions, proximity to an event place, and so on).
- Local suggestion: whenever geospatial-constrained recommendations are necessary, we can force the weight of a topic (e.g. close to an airport) or, when possible, we use the mobile GPS to determine the position and create a local neighborhood.
- Item-tag-match: When the items are tagged, the profiles are exploited in two ways:
- To re-evaluate the tagging (e.g. “The hotel has satellite TV” is identified as false by the majority of reviews and its weight is decreased)
- To match user and items immediately without the need of collaborative filtering or other clustering approaches
Moreover, the user-profile network has been designed to consider a feedback system, as shown in the following figure:
The goal is to reweight the recommendations considering the common factors between each item and the user profile. Let’s suppose that the item i has been recommended because it matches the positive preference for the topics t1,t3,t5. If the user sends a negative feedback message, the engine will reweight these topics using an exponential function (based on the number of received feedbacks). Conversely, a positive feedback will increase the weight. The approach is based on a simple reinforcement learning strategy, but it allows a fine-tuning based on every single user.
Conclusions
Recommendations are becoming more and more important and, in many cases, the success of a service depends on them. Our experience taught us that reviews are more “accepted” than surveys and provides many more pieces of information than explicit feedbacks. Current Deep Learning techniques allow processing the natural language and all the jargon expressions in real-time, with an accuracy that is comparable to a human agent. Moreover, whenever the reviews are prefiltered so to remove all spam and meaningless ones, their content reflects every personal experience often much better than a simple rating. In our future developments, we plan to implement the engine in different industries, collecting reviews, together with public Tweets with specific hashtags and increase the “vestigial sentiment analysis” ability with deeper models and different word-vectors strategies (like FastText, that works with phonemes and doesn’t have a loss of performances with rare words).
References
- Blei D. M., Ng A. Y., Jordan M. I., Latent Dirichlet Allocation, Journal of Machine Learning Research 3/2003
- Das R., Zaheer M., Dyer C., Gaussian LDA for Topic Models with Word Embeddings, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 07/2015
- Bonaccorso G., Machine Learning Algorithms, Packt Publishing, 2017
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt Publishing, 2018
- Mikolov T., Sutskever I., Chen K., Corrado G., Dean J., Distributed Representations of Words and Phrases and their Compositionality, arXiv:1310.4546 [cs.CL]
- Pennington J., Socher R., Manning C. D., GloVe: Global Vectors for Word Representation, Stanford University
- Joulin A., Grave E., Bojanowski P., Mikolov T., Bag of Tricks for Efficient Text Classification, arXiv:1607.01759 [cs.CL]
- Bonaccorso G., Reuters-21578 text classification with Gensim and Keras, https://www.bonaccorso.eu/2016/08/02/reuters-21578-text-classification-with-gensim-and-keras/
- https://radimrehurek.com/gensim/models/word2vec.html
Are recommendations really helpful? A brief non-technical discussion – Giuseppe Bonaccorso
Many times I had the opportunity to answer the question: “Are recommendations so important for my B2C service?”. Every time, my answer was the same: “It depends”. Clearly, I don’t want to be vague just to avoid the question, but the reality is much more complex than any mathematical model (and when a model gets closer to the reality, it becomes intractable).
2 thoughts on “Recommendations and User-Profiling from Implicit Feedbacks”
Hi Giuseppe,
Your work looks so interesting.
I’d love to get a bit more insight about how the “word vector selector” works exactly thought.
Thanks
Antoine
Thanks, Antoine.
The word vector selector has not been detailed because I’m planning to post a complete working example. However, the idea is based on a “pseudo-attention” mechanism implemented with a simple MLP with a softmax output (the input length is fixed and the sentences are padded or truncated). Each value represents the probability of a specific word to be representative of a context. The network is trained with labeled examples and, thanks to word-vectors, is also very robust to synonyms.
Each training couple is made up of (wv1, wv2, …, wnN) -> (p1, p2, …, pN), where the probabilities are non-zero only for the representative vectors (an alternative approach is based on sigmoids, but the training speed was slower and the final accuracy worse). E.g. “The restaurant is nice but the food is quite bad” -> Word vectors -> Targets: “restaurant” and “food” (so the softmax output would be 0.0, 0.0, 0.5, …, 0.5, 0.0, 0.0). As we want to perform a “local” sentiment analysis, each “peak” in the softmax is surrounded by a set of additional words. Hence, in this case, for example, we want a peak for “restaurant” and a smaller value for “nice” (e.g. 0.3, 0.2).
Once this submodel has been trained, we froze it and trained the convolutional network. I hope this very brief explanation could be helpful.