The post Machine Learning Algorithms – Second Edition appeared first on Giuseppe Bonaccorso.

]]>From the back cover:

Machine learning has gained tremendous popularity for its powerful and fast predictions through large datasets. However, the true forces behind its powerful output are the complex algorithms involving substantial statistical analysis that churn large datasets and generate sufficient insight.

This second edition of Machine Learning Algorithms walks you through prominent development outcomes that have taken place relating to machine learning algorithms, which constitute major contributions to the machine learning process and help you to strengthen and master statistical interpretation across supervised, semi-supervised, and reinforcement learning areas. Once the core concepts of an algorithm have been exposed, you’ll explore real-world examples based on the most diffused libraries, such as scikit-learn, NLTK, TensorFlow, and Keras. You will discover new topics such as principal component analysis (PCA), independent component analysis (ICA), Bayesian regression, discriminant analysis, advanced clustering, and Gaussian mixture.

By the end of this book, you will have studied machine learning algorithms and be able to put them into production to make your machine learning applications more innovative.

ISBN: 9781789347999

Link to the publisher page: https://www.packtpub.com/big-data-and-business-intelligence/machine-learning-algorithms-second-edition

Source code repository: https://github.com/PacktPublishing/Machine-Learning-Algorithms-Second-Edition

## Machine Learning Algorithms – Giuseppe Bonaccorso

My latest machine learning book has been published and will be available during the last week of July. From the back cover: In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science.

The post Machine Learning Algorithms – Second Edition appeared first on Giuseppe Bonaccorso.

]]>The post Recommendations and User-Profiling from Implicit Feedbacks appeared first on Giuseppe Bonaccorso.

]]>The vast majority of B2C services are quickly discovering the strategic importance of solid recommendation engines to improve the conversion rates and an establish a stronger fidelity with the customers. The most common strategies are based [3] on the segmentation of users according to their personal features (age range, gender, interests, social interactions, and so on) or to the ratings they gave to specific items. The latter approach normally relies on explicit feedbacks (e.g. a rating from 0 to 10) which summarize the overall experience. Unfortunately, there are drawbacks to both cases.

Personal data are becoming harder to retrieve and the latest regulations (i.e. GDPR) allow the user to interact with a service without the collection of data. Moreover, a reliable personal profile must be built using many attributes that are often hidden and can only be inferred using predictive models. Conversely, implicit feedbacks are easy to collect but they are extremely scarce. Strategies based on the factorization of the user-item matrix [3] are easy to implement (considering the possibility to employ parallelized algorithms), but the discrepancy between the number of ratings and number of products is normally too high so to allow an accurate estimation.

Our experience showed that it’s possible to create very accurate user profiles (also for an anonymous user identified only by a random *ID*) exploiting the implicit feedbacks present in textual reviews. Our main assumption is that reviews which are classified as spam reflect the wishes of a user with respect to a specific experience. In other words, a simple sentence like “The place was very noisy” implies that I prefer quiet places or, adding extra constraints, that, in a specific situation (characterized by *n* factors), a quiet place is preferable than a noisy one. The approach we are going to present is based on this hypothesis and can be adapted in order to cope with many different scenarios.

We assume that a user is represented as a real-valued feature vector of fixed length *Nu*:

Each topic *t**i* is represented by a label corresponding to a set of semantically coherent words, for example:

Standard topic modeling approaches (like LDA [1]) can be employed to discover the topics defined in a large set (corpus) of reviews, however, this strategy can lead to a large number of incoherent results. The main reasons are:

- LDA works with vectorizations based on frequencies or TF-IDF, therefore the words are not considered in their semantic context
- One fundamental assumption is that each topic is characterized by a small set of words which are peculiar to it

While the latter can be easily managed, the former is extremely problematic. In fact, reviews are not homogenous documents that can be trivially categorized (e.g. politics, sport, economics, and so forth). On the contrary, they are often made up of the same subset of words that, however, can acquire different meanings thanks to the relative context.

This problem can be solved with different Deep-Learning solutions (like Word2Vec [5], GloVe [6], or FastText [7]) but we have focused on Word2Vec with a Skip-Gram model.

The main idea is to train a deep neural network in order to find the best vectorial representation of a word considering the context where it’s placed in all the documents. The generic structure of the network is shown in the following figure:

Each review is split into sub-reviews of 2n words around a *pivot* that we feed as input. The goal of the training process is to force the network to rebuild the context given the pivot. As the vectors are rather long (100-300 32 bit float values), the process will converge to a global solution where semantically similar words are placed into very dense balls whose radius is quite smaller than the whole dictionary sub-space. Hence, for example, the word “room” will be implicitly a synonym of “chamber” and the cosine distance between the two vectors will be extremely small. As many people use jargon terms in their reviews (together with smileys or other special characters), this approach guarantees the robustness needed to evaluate the content without undesired biases. For further technical details, I suggest reading [5], however, the general dynamic is normally based on a random initialization of the vectors and a training process where the output layer is made up of softmax units representing the probability of each word. As this approach is unfeasible considering the number of words, techniques like Hierarchical Softmax or Negative Contrastive Estimations are employed as cost functions. In general, after a few iterations (10 – 50), the global representation is coherent with the underlying semantic structure.

To create the recommender, we have selected 500,000 reviews written in English and we have employed Gensim [9] to create the vectorial representation after a basic NLP preprocessing (in particular, stop-words and punctuation removal). The vector size has been chosen to be 256, so to optimize the GPU support in the next stages.

As explained before, standard LDA cannot be applied with word-vectors, but Das et al. proposed a solution [2] that allows modeling the topics as multivariate Gaussian distributions. As the authors observed, this choice offers an optimal solution for real-valued representations that can be compared using the Euclidean distance. An example is shown in the following figure:

What we observe at the end of the training is:

This is is an obvious consequence of the triangle inequality, but it’s extremely useful when working with implicit feedbacks, in fact, we can avoid being worried about misunderstood synonyms unless they are in incoherent contexts. However, assuming that the reviews are prefiltered to avoid spam, this possibility is very rare.

Gaussian LDA requires to specify a desired number of topics. In our project, we have started with 100 topics and, after a few evaluations, we have reduced the number to 10. An example of 3 topics (in the context of Berlin hotel reviews) is shown in the following table:

Room/Quality |
Service |
Location |

Furniture |
Receptionist | Station |

Bed | English | Airport |

… | … | … |

Cleaning | Information | Shuttle service |

TV/Satellite | Map | Alexanderplatz |

Bathroom | Directions | Downtown |

These are only a few selected words, but during our analysis, we filtered many redundant terms out. However, it’s easy to check that, for example:

In order to have a deeper understanding of the reviews and exploit them to create user profiles, we have also employed a sentiment analysis model, which is trained to perform what we have called “vestigial analysis”. In other words, we need to extract the sentiment of every agnostic word (like “Bed”) considering a small surrounding context. Let’s consider the following example:

Considering the previous example we want to create a sparse feature vector as a function of the user and the item:

Every coefficient *β**i* represents the average sentiment (normalized between -1.0 and 1.0) for the specific topic. As each review covers only a subset of topics, we assume most *β**i*=0. For example, in the previous case, we are considering only the keywords: room, hotel, square, and located, which refers respectively to the topics t1=”Room/Quality” and t3=”Location”. In this case, the attribute “cheap” has a neutral sentiment because this word can be used also to define poor quality hotels. The resulting vector becomes:

Once all the reviews have been processed (at the beginning the process is purely batch, but it becomes incremental during the production phase), we can obtain the user-profile feature vectors as:

The resulting vector contains an average of all topics with a weight obtained through the sentiment analysis. In particular scenarios, we have also tested the possibility to reweight the coefficients according to the global average or secondary factors. This approach allows avoiding undesired biases for topics which should be secondary.

The neural model to perform the sentiment analysis is shown in the following schema:

The dataset has been labeled selecting the keywords and their relative sentiment. The role of the “Word vector selector” is to create an “attention mechanism” that allows focusing only on the relevant parts. In particular, it received the whole sequence and the positions of the target keys and outputs a synthetic vector that is fed into the following sub-network.

In order to exploit the composition ability of word-vectors, we have implemented 1D convolutions that can evaluate the local relationships. Horizontal pooling layers didn’t boost the performances as expected and have been excluded. All convolutions are based on ReLU activations.

The output of the network is made up of 15 hyperbolic tangent units that represent the sentiment for each topic. This choice allowed to discard neutral sentiments as their value is close to zero. In our practical implementation, we have thresholded the output in order to accept absolute values greater than 0.1. This choice allowed to denoise the predictions while keeping a very high accuracy. The training has been performed using 480,000 reviews and the validation has been done on 20,000 reviews. The final validation accuracy is about 88% but deeper models with larger corpora are showing better performances. An alternative approach is based on an LSTM layer that captures the structural sequence of the review:

LSTMs are extremely helpful when the reviews are not very short. In this case, the peculiar ability to discover long-term dependencies allows processing contexts where the target keyword and the qualifying elements are not very close. For example, a sentence like: “The hotel, that we booked using the Acme service suggested by a friend of us, is really excellent” requires a more complex processing to avoid the network to remain stuck in short contexts. Seq2Seq with attention has been successfully tested, with the goal to output synthetic vectors where the most important elements are immediately processable. We have also noticed that convolutions learn much faster when they cope with standardized structures. This approach, together with other improvements, is the goal of future developments.

Once all the reviews have been processed, the feature vectors can be directly employed for recommendations. In particular, we have tested and implemented the following solutions:

- User clustering: in this case, the users are grouped using K-Means and Spectral Clustering (this is particularly useful as the clusters are often non-convex) at a higher level. Then, each cluster is mapped using a Ball-Tree and a k-Nearest Neighbors algorithms selected the most similar users according to a variable radius. As the feature vectors represent the comments and latent wishes, very close users can share their experiences and, hence, the engine suggests new items in a discovery section. This approach is an example of collaborative filtering that can be customized with extra parameters (real-time suggestions, proximity to an event place, and so on).
- Local suggestion: whenever geospatial-constrained recommendations are necessary, we can force the weight of a topic (e.g. close to an airport) or, when possible, we use the mobile GPS to determine the position and create a local neighborhood.
- Item-tag-match: When the items are tagged, the profiles are exploited in two ways:
- To re-evaluate the tagging (e.g. “The hotel has satellite TV” is identified as false by the majority of reviews and its weight is decreased)
- To match user and items immediately without the need of collaborative filtering or other clustering approaches

Moreover, the user-profile network has been designed to consider a feedback system, as shown in the following figure:

The goal is to reweight the recommendations considering the common factors between each item and the user profile. Let’s suppose that the item i has been recommended because it matches the positive preference for the topics t1,t3,t5. If the user sends a negative feedback message, the engine will reweight these topics using an exponential function (based on the number of received feedbacks). Conversely, a positive feedback will increase the weight. The approach is based on a simple reinforcement learning strategy, but it allows a fine-tuning based on every single user.

Recommendations are becoming more and more important and, in many cases, the success of a service depends on them. Our experience taught us that reviews are more “accepted” than surveys and provides many more pieces of information than explicit feedbacks. Current Deep Learning techniques allow processing the natural language and all the jargon expressions in real-time, with an accuracy that is comparable to a human agent. Moreover, whenever the reviews are prefiltered so to remove all spam and meaningless ones, their content reflects every personal experience often much better than a simple rating. In our future developments, we plan to implement the engine in different industries, collecting reviews, together with public Tweets with specific hashtags and increase the “vestigial sentiment analysis” ability with deeper models and different word-vectors strategies (like FastText, that works with phonemes and doesn’t have a loss of performances with rare words).

- Blei D. M., Ng A. Y., Jordan M. I., Latent Dirichlet Allocation, Journal of Machine Learning Research 3/2003
- Das R., Zaheer M., Dyer C., Gaussian LDA for Topic Models with Word Embeddings, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 07/2015
- Bonaccorso G., Machine Learning Algorithms, Packt Publishing, 2017
- Bonaccorso G., Mastering Machine Learning Algorithms, Packt Publishing, 2018
- Mikolov T., Sutskever I., Chen K., Corrado G., Dean J., Distributed Representations of Words and Phrases and their Compositionality,
**arXiv:1310.4546****[cs.CL]** - Pennington J., Socher R., Manning C. D., GloVe: Global Vectors for Word Representation, Stanford University
- Joulin A., Grave E., Bojanowski P., Mikolov T., Bag of Tricks for Efficient Text Classification,
**arXiv:1607.01759****[cs.CL]** - Bonaccorso G., Reuters-21578 text classification with Gensim and Keras, https://www.bonaccorso.eu/2016/08/02/reuters-21578-text-classification-with-gensim-and-keras/
- https://radimrehurek.com/gensim/models/word2vec.html

## Are recommendations really helpful? A brief non-technical discussion – Giuseppe Bonaccorso

Many times I had the opportunity to answer the question: “Are recommendations so important for my B2C service?”. Every time, my answer was the same: “It depends”. Clearly, I don’t want to be vague just to avoid the question, but the reality is much more complex than any mathematical model (and when a model gets closer to the reality, it becomes intractable).

The post Recommendations and User-Profiling from Implicit Feedbacks appeared first on Giuseppe Bonaccorso.

]]>The post Are recommendations really helpful? A brief non-technical discussion appeared first on Giuseppe Bonaccorso.

]]>Nowadays, the smallest online store has many more products than the largest physical store. Moreover, the number of website selling products is increasing, even if the largest companies continue trying to establish monopolies in almost any country. If we add all the B2C services that provide specific pieces of information (e.g. hotels, movies, bars, and so on), the number of possibilities becomes extremely high.

Suppose to have dozens of T-shirts. They are all clean are ready to use. You wake up in the morning and need to pick one of them. According to a merely statistical view, the uncertainty is proportional to the number of possible choices and the real life isn’t so different. However, if you have, for example, 100 T-shirts and you write every day your choice on a notebook, after one year, you can plot a histogram. Simply count the number of times you picked a T-shirt (not completely randomly, because we seldom make actual random decisions) and draw a bar proportional to this number. What kind of shape do you expect?

Probably many people have answered: “A flat line”, like the one showed in the following figure:

It’s possible, but unfortunately, in the majority of cases, the shape will be different. Even if you bought all the 100 T-shirts, your attraction for each of them is different. Moreover, when a decision is unconstrained (e.g. the T-shirt is dirty), you’re going to make your selection according to an almost unconscious priority-management system. Such a schema (considering its nature, we can call it Linus’ blanket bias) doesn’t represent an exclusive scenario: it happens everywhere and in many different situations. The result is shown in the next diagram, where the peak represents a favorite color, brand, or simply the most adaptable choice (i.e. we can save time):

At this point, we can try to give a definition of “recommendation”. Starting from the assumption that the majority of choices are biased, we can try to determine two important factors. The first one is the attraction point. In other words, we can simply recommend employing analogies. This is a 98%-sure strategy in terms of acceptance because everybody desires (consciously or unconsciously) to hear what he/she already thinks. However, the conversion rate can be low, because the set of possible choices around this point is extremely wide. In other words, the recommendation is not rejected but seldom considered for an “investment”.

The other factor is the discovery desire. I assume that everybody, to a certain extent, desires to discover new things, but, unfortunately, we don’t know the objects and, moreover, the search direction. For example, I like blue T-shirts, but I have seen a new pattern that attracts me too. In this case, the two elements are the specific item I saw and the pattern. The first one is like the centroid of a cluster and I’m very likely to buy it. The second one (that I’ve previously called direction) is the set of elements belonging to this cluster. In general, a recommendation of any item belonging to the cluster (in particular, if it’s close the centroid) will be positively accepted or, in the worse case, it will be used to start an exploration. In both cases, the conversion rate is generally higher.

Hence, recommendations are surely important, but only when they can drive to a change in a mental schema. This is not an absolute statement, but, during my experiences, I’ve often received positive confirmations. On the other side, many new machine learning approaches are focused on these ideas and are moving from a classic “similarity-based” approach to a more complex schema where an unsupervised approach is employed together with a reinforcement learning one. It can seem weird, but the simplest strategy to improve a recommendation is asking the user to evaluate the suggestions. This is true, not because the users are always perfectly aware of their desires, but thanks to the fact that a larger set of possibilities can be efficiently pruned using the feedback.

Moreover, if a good recommendation should be a discovery, adding some “noise” is generally helpful. Let’s suppose this scenario: a user visits a website where he often buys products. He’s on the subway and has some time to spare, so he starts checking a few products shown in the home page, then he searches for something specific (don’t be surprised if he finds the same items seen in the morning because repeating a few actions is typical interaction pattern). Then, maybe, he sees a product that didn’t catch his attention before. Many tools (like heat-maps or clickstream analyzers) can be employed to gather all the necessary pieces of information, however, let’s suppose that we know that he read the description, scrolled up and down many times and he zoomed on the pictures.

What kind of information can we obtain from this session? First of all, that the product and its features are relevant for that customer (even he never bought a similar product). Moreover, we can determine some behavioral patterns that are often considered useless. For example, what are the pictures that attracted his attention? How long did he look at a zoomed picture? Let’s suppose, that the item is a backpack. Did he look at the lateral profile? Probably, he’s interested in large or narrow backpacks. So, now we have a new centroid for a bunch of recommendations. How can we use the feedback? In this case, for example, we restrict the set by understanding whether he’s interested in large or narrow backpacks. We can pick a few representative elements and show them during the next visit.

What is the user going to do? There are many possible questions and, obviously, the answers can improve the recommendations or worsen them. If the user simply click only on the narrow backpack, a reinforcement learning approach will increase the expected reward of a sequence of actions (in this case, they can be simply products, supposing that the user is an agent that has to make a single decision), driving the model in the direction where the user is probably looking. It doesn’t matter if he continues to buy the same items: whenever such kind of behavioral patterns are present and analyzed, recommendations will be closer to the concept of discovery and the additional conversion rate (we assume not the alter some established behaviors) will be proportionally higher. Clearly, this is not magic, but statistical analysis of social and psychological contexts and, yes, human beings are much more predictable than expected!

This is, for instance, the case of books. If you like thrillers, you’re probably going to buy them. But if you are attracted by a recipe book (in the way discussed before), a good series of suggestions can increase the probability that you’re going to add a recipe book to your common list of items.

Hence, we can summarize the main elements of a successful system:

- Recommendations must be discoveries and discoveries cannot confirm what a user already knows.
- Interaction behavioral patterns can be used to understand the hidden (or latent) interest of a user
- Feedbacks can help in refining the suggestions
- Exploiting the refined pattern can dramatically increase the probability of conversion for uncommon items
- Standard (or classical) recommendation strategies can be also employed as a first or a secondary approach
- The goal of an online store or a B2C service is not to limit the access to the products/items, but to widespread them, hence, a little bit of noise can increase the discovery factor and help the customers to make up their mind (strange but true, because, in a case the user can pick a new item, in the opposite, he can decide to pursue his initial search strategy as the alternative are inadequate)

Therefore, my answer is: “Yes! Recommendations are extremely important… unless they don’t try to reinvent the wheel (in the mind of a customer)!”

(Also published on Medium)

## SVD Recommendations using Tensorflow – Giuseppe Bonaccorso

Recommendation system based on the user-item matrix factorization have become more and more important thanks to powerful and distributable algorithms like ALS, but sometimes the number of users and/or items is not so huge and the computation can be done using directly a SVD (Singular Value Decomposition) algorithm.

The post Are recommendations really helpful? A brief non-technical discussion appeared first on Giuseppe Bonaccorso.

]]>The post A book that every data scientist should read appeared first on Giuseppe Bonaccorso.

]]>Unfortunately, it’s easier to value a belief even more than a factual analysis and we all know what kind of disasters can be brought about. And, it’s even worse when a data scientist analyzes a bunch data starting from a biased viewpoint. Even if there are dozens of powerful models, some tasks require some moderation. Predicting the future of a time-series is a hard job unless we are 100% sure that nothing can alter what we have discovered.

Such an approach leads to the ideas of trends, which are simple to manage but dangerous as atom bombs. How many measures can only grow or decrease? In the short-term, a trend is an acceptable behavior, but are we authorized to extend this analysis also to the long-term? Clearly, the answer is negative. That’s why there are periodicities, seasonalities, and saturation. One of the first analysis that caught my attention was the growth of world population. Almost everybody is driven to think that such a value can only be larger and larger, until a catastrophic event.

Luckily many systems (including the population living in a limited environment) admit fixed points. This means that, believe it or not, they are going to grow until a threshold and then they saturate. Other ones are unstable and keep oscillating (stock values, for example). There can be a long period characterized by trends, but at a certain point, the internal dynamics force the system to invert its trend and the process is reset.

A good data scientist should always keep these simple concepts in mind before starting any analysis. I’m perfectly aware that too many CEOs would like to know that their profit is going to grow indefinitely. As well as, many politicians would observe a more and more extended consensus. We know that even the best companies have to change their strategies and there are no “immortal” political parties.

Moreover, data analysis is totally incompatible with beliefs (I’m not referring to statistical beliefs that are tested in many ways, but to irrational or unprovable statements). A data scientist must always work with facts and run away from any potential belief. it doesn’t matter if 98% of the people think that A is true. The question should always be: are the data confirming these hypotheses? If the answer is yes, that isn’t an actual belief, but common sense derived from observations, otherwise, it’s better to avoid going on. The idea of likelihood is the data scientist’s best friend. When the data have been denoised (sometimes this is a problematic step, but, in general, it can be carried out efficiently), a model (with the underlying hypotheses) must be checked in terms of likelihood.

Observing the world population data, it’s possible to see that the growth speed is slowing down and an exponential growth is very unlikely. A data scientist should notice that almost immediately, but a belief could lead him/her to a misunderstanding. In the following plot provided by Max Roser, you can find the confirmation to the previous statement (compare the growth rate with the progressive count):

Prey-Predators models have fixed points due to a dynamic interaction between opposite factors. But how can this be true also for human beings? The average life duration is increasing (but, also in this case, I suppose that nobody thinks about a stable trend) and healthcare is becoming more and more efficient and effective. Hence, a natural consideration is: the population must increase until there are no more resources (like in a Prey-Predator model) and, at that point, starvation and diseases will dramatically reduce the number.

Such a scenario is wrong (read the book to know the answer and many common objections!), but the population is saturating and this is the most likely and rational hypothesis. Therefore, if you are a young or experienced data scientist, I sure you’ll find this book extremely interesting and maybe also very challenging. Maybe I’m wrong, but, please don’t consider mine as a belief! At most, propose alternative viewpoints! An optimal solution can arise only from the interaction of different alternatives based on rock-solid facts!

The plot of world population growth has been provided by Max Roser under CC license.

The post A book that every data scientist should read appeared first on Giuseppe Bonaccorso.

]]>The post Mastering Machine Learning Algorithms appeared first on Giuseppe Bonaccorso.

]]>From the back cover:

Machine learning is a subset of AI that aims to make modern-day computer systems smarter and more intelligent. The real power of machine learning resides in its algorithms, which make even the most difficult things capable of being handled by machines. However, with the advancement in the technology and requirements of data, machines will have to be smarter than they are today to meet the overwhelming data needs; mastering these algorithms and using them optimally is the need of the hour.

Mastering Machine Learning Algorithms is your complete guide to quickly getting to grips with popular machine learning algorithms. You will be introduced to the most widely used algorithms in supervised, unsupervised, and semi-supervised machine learning, and will learn how to use them in the best possible manner. Ranging from Bayesian models to the MCMC algorithm to Hidden Markov models, this book will teach you how to extract features from your dataset and perform dimensionality reduction by making use of Python-based libraries such as scikit-learn. You will also learn how to use Keras and TensorFlow to train effective neural networks.

If you are looking for a single resource to study, implement, and solve end-to-end machine learning problems and use-cases, this is the book you need.

ISBN: 9781788621113

Publisher page: https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-algorithms

Code samples: https://github.com/PacktPublishing/Mastering-Machine-Learning-Algorithms

Safari books: https://www.safaribooksonline.com/library/view/mastering-machine-learning/9781788621113/

Google Books: https://books.google.de/books?id=2HteDwAAQBAJ&printsec=frontcover&dq=mastering+machine+learning+algorithms

Gitter chatroom: https://gitter.im/Machine-Learning-Algorithms/Lobby

## Machine Learning Algorithms – Giuseppe Bonaccorso

My latest machine learning book has been published and will be available during the last week of July. From the back cover: In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science.

The post Mastering Machine Learning Algorithms appeared first on Giuseppe Bonaccorso.

]]>The post Fundamentals of Machine Learning with Scikit-Learn appeared first on Giuseppe Bonaccorso.

]]>From the notes:

As the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine Learning applications are everywhere, from self-driving cars, spam detection, document searches, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of big data and data science. The main challenge is how to transform data into actionable knowledge.

In this course you will learn all the important Machine Learning algorithms that are commonly used in the field of data science. These algorithms can be used for supervised as well as unsupervised learning, reinforcement learning, and semi-supervised learning. A few famous algorithms that are covered in this book are: Linear regression, Logistic Regression, SVM, Naive Bayes, K-Means, Random Forest, and Feature engineering. In this course, you will also learn how these algorithms work and their practical implementation to resolve your problems.

ISBN: 9781789134377

Link to the publisher page: https://www.packtpub.com/big-data-and-business-intelligence/fundamentals-machine-learning-scikit-learn-video

## Machine Learning Algorithms – Giuseppe Bonaccorso

My latest machine learning book has been published and will be available during the last week of July. From the back cover: In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science.

The post Fundamentals of Machine Learning with Scikit-Learn appeared first on Giuseppe Bonaccorso.

]]>The post Getting Started with NLP and Deep Learning with Python appeared first on Giuseppe Bonaccorso.

]]>As the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine Learning applications are everywhere, from self-driving cars to spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of Big Data and Data Science. The main challenge is how to transform data into actionable knowledge.

In this course, you’ll be introduced to the Natural Processing Language and Recommendation Systems, which help you run multiple algorithms simultaneously. Also, you’ll learn about Deep learning and TensorFlow. Finally, you’ll see how to create an Ml architecture.

ISBN: 9781789138894

Link to the publisher page: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-nlp-and-deep-learning-python-video

## Machine Learning Algorithms – Giuseppe Bonaccorso

The post Getting Started with NLP and Deep Learning with Python appeared first on Giuseppe Bonaccorso.

]]>The post Hetero-Associative Memories for Non Experts: How “Stories” are memorized with Image-associations appeared first on Giuseppe Bonaccorso.

]]>Computer science drove us to think that memories must always be loss-less, efficient and organized like structured repositories. They can be split into standard-size slots and every element can be stored into one or more slots. Once done, it’s enough to save two references: a pointer (a positional number, a couple of coordinates, or any other element) and the number of slots. For example, the book “*War and Peace*” (let’s suppose its length is 1000 units) can be stored in a memory at the position 684, so, the reference couple is (684, 1000). When necessary, it’s enough to retrieve 1000 units starting from the position 684 and every single, exact word written by Tolstoj will appear in front of your eyes.

A RAM (**Random Access Memory**) works in this way (as well as Hard Disks and other similar supports). Is this an efficient storage strategy? Of course, it is, and every programmer knows how to work with this kind of memory. A variable has a type (which also determines its size). An array has a type and a dimensional structure. In both cases, using names and indexes, it’s possible to access to every element at a very high speed.

Now, let’s think about the small prey again. Is its memory structured in this way? Let’s suppose it is. The process can be described with the following steps: the sound is produced, the pressure waves propagate at about 330 m/s and arrive at the ears. A complex mechanism transforms the sound into an electric signal which is sent to the brain where a series of transformations should drive the recognition. If the memory were structured like a book-shelf, a scanning process should be performed, comparing each pattern with the new one. The most similar element has to be found as soon as possible and the consequent action has to be chosen. The worst-case algorithm for a memory with n locations has this structure:

- For each memory location i:
- If Memory[i] == Element:
- return Element

- Else Continue

- If Memory[i] == Element:

The computational cost is O(n), which is not so bad but requires a maximum number of n comparisons. A better solution is based on the concept of *hash* or signature (for further information see Hash functions). Each element is associated with a (unique) hash, which can be an integer number used as an index for an over-complete array (N >> n). The computational cost for a good hash function is constant O(1) and so the retrieval phase (because the hash is normally *almost unique*), however, another problem arises. A RAM-like memory needs exact locations or a complete scan, but instead of direct comparisons, it’s possible to use a similarity measure (that introduces some fuzziness, allowing to match noisy patterns). With a hash function, the computational cost is dramatically reduced, but the similarity becomes almost impossible because the algorithms are studied in order to generate complete different hashes even for very small changes in the input.

At the end of the day, this kind of memory has too many problems and a natural question is: how can animals manage them all? Indeed animal brains avoid these problems completely. Their memory isn’t a RAM and all the pieces of information are stored in a completely different way. Without considering all the distinctions introduced by cognitive psychologists (short-term, long-term, work memory, and so on), we can say that an input pattern A, after some processing steps is transformed into another pattern B:

Normally B has the same *abilities* of an original stimulus. This means that a correctly recognized roar elicits the same response of the sight of an actual lion, allowing a sort of prediction or action anticipation. Moreover, if A is partially corrupted with respect to the original version (here we’re assuming Gaussian noise), the function is able to denoise its output:

This approach is called **associative **and it has been studied by several researchers (like [1] and [2]) in the fields of computer science and computational neurosciences. Many models (sometimes completely different in their mathematical formulation) have been designed and engineered (like BAMs, SOMs, and Hopfield Networks). However, their inner logic is always the same: a set of similar patterns (in terms of coarse-grained/fine-grained features) must elicit a similar response and the inference time must as shortest as possible. If you want to briefly understand how these some of these models work, you can check these previous articles:

In order to summarize this idea, you can consider the following figure:

The blue line is the representations of a **memory-surface**. At time t=0, nothing has been stored and the line is straight. After some experiences, two basins appear. With some fantasy, if the image a new cat is sent close to the basin, it will fall down till reaching the minimum point, where the concept of “cat” is stored. The same happens for the category of “trucks” and so forth for any other semantic element associated with a specific perception. Even if this approach is based on the concept of energy and needs a dynamic evolution, it can be elegantly employed to explain the difference between random access and associate one. At the same time, starting from a basin (which is a minimum in the memory-surface), it’s possible to retrieve a family of patterns and their common features. This is what Immanuel Kant defined as **figurative synthesis** and represents one of the most brilliant results allowed by the neocortex.

In fact, if somebody asks a person to think about a cat (assuming that this concept is not too familiar), no specific images will be retrieved. On the contrary, a generic, feature-based representation is evoked and adapted to any possible instance belonging to the same family. To express this concept in a more concise way, we can say that we can recover whole concepts through the collection of common features and, if necessary, we can match these features with a real instance.

For examples, there are some dogs that are not very dissimilar to some cats and it’s natural asking: it is a dog or a cat? In this case, the set of features has a partial overlap and we need to collect further pieces of information to reach a final decision. At the same time, if a cat is hidden behind a curtain and someone invites a friend to imagine it, all the possible features belonging to the concept “cat” will be recovered to allow guessing color, breed and so on. Try yourself. Maybe the friend knows that fluffy animals are preferred, so he/she is driven to create the model on a Persian. However, after a few seconds, a set of attributes is ready to be communicated.

Surprisingly, when the curtain is opened, a white bunny appears. In this case, the process is a little bit more complex because the person trusted his/her friend and implicitly assigned a very high priority to all pieces of information (also previously collected). In terms of probability, we say that the prior distribution was peaked around the concept of “cat”, avoiding spurious features to corrupt the mental model. (In the previous example, there were probably two smaller peaks around the concept of “cat” and “dog”, so the model could be partially noisy, allowing more freedom of choice).

When the image appears, almost none of the main predicted features matched with the bunny, driving the brain to *reset* its belief (not immediately because the prior keeps a minimum doubt). Luckily, this person has seen many rabbits before that moment and even, after all the wrong indications, his/her associative memories can rapidly recover a new concept, allowing the final decision that the animal isn’t a cat. A hard-drive had to go back and forth many times, slowing down the process in a dramatic way.

An hetero-encoder is structurally identical to an auto-encoder (see Lossy Image Autoencoders). The only difference is the association: the latter trains a model in order to obtain:

While an hetero-encoder trains a model that is able to perform the association:

The source code is reported in this GIST and at the end of the article. It’s based on Tensorflow (Python 3.5) and it’s split in an encoding part, which is a small convolutional network followed by a dense layer. Its role is to transform the input (Batch size × 32 × 32 × 3) into a feature vector that can be fed into the decoder. This one processes the feature vector with a couple of dense layers and performs a deconvolution (transpose convolution) to build the output (Batch size × 32 × 32 × 3).

The model is trained using an L2 loss function computed on the difference between expected and predicted output. An extra L1 loss can be added to the feature vector to increase the sparsity. The training process takes a few minutes with GPU support and 500 epochs.

The model itself is not complex, nor based on rocket science, but a few considerations are useful to understand why this approach is really important:

- The model is implicitly cumulative. In other words, the function g(•) works with all the input images, transforming them into the corresponding output.
- No If statements are present. In the common algorithmic logic, g(•) should check for the input image and select the right transformation. On the contrary, a neural model can make this choice implicitly.
- All pattern transformations are stored in the parameter set whose plasticity allows a continuous training
- A noisy version of the input elicits a response whose L2 loss with the original one is minimized. Increasing the complexity of both encoder and decoder, it’s possible to further increase the noise robustness. This is a fundamental concept because it’s almost impossible to have two identical perceptions.

In our “experiment”, the destination dataset is a shuffled version of the original one, therefore different periodic sequences are possible. I’m going to show this with some fictional fantasy. Each sequence has the fixed length of 20 pictures (19 associations). The first picture is freely chosen, while all the others are generated with a chain process.

I’ve randomly selected (with fixed seed) 50 Cifar-10 pictures (through Keras dataset loading function) as building blocks for the hetero-associative memory. Unfortunately, the categories are quite weird (frogs, ostriches, deer, together with cars, planes, trucks, and, obviously, lovely cats), but they allow using some fantasy in recreating the possible outline. The picture *collage* is shown in the following figure:

The original sequence (X_source) is then shuffled and turned into a destination sequence (X_dest). In this way, each original image will be always associated with another one belonging to the same group and different periodic sequences can be discovered.

The user can enlarge the dataset and experiment different combinations. In this case, with 50 samples I’ve discovered a few interesting sequences, that I’ve called “stories”.

John is looking at a boat, while a noisy truck behind him drew his attention to a concrete wall that was being built behind him.

While walking in a pet-shop, John saw an exotic frog and suddenly remembered that he needed to buy some food for his cat. The favorite brand is Acme Corp. and he saw several times their truck. Meanwhile, the frog croaks and he turns his head towards another terrarium. The color of the sand and the artificial landscape drive him to think about a ranch where he rode a horse for the first time. While trotting next to a pond, a duck drew his attention and he was about to fall down. In that moment the frog croaked again and he decided to harry up.

John is looking at a small bird, while he remembers his grandpa, who was a passionate bird-watcher. He had a light-blue old car and when John was a child, during a short trip with his grandpa, he saw an ostrich. Their small dog began barking and he asked his grandpa to speed up, but he answered: “Hey, this is not a Ferrari!” (Ok, my fantasy is going too fast…). During the same trip, they saw a deer and John took a photo with his brand new camera.

Story A:

Story B:

These two cases are a little bit more complex and I prefer not fantasize. The concept is always the same: an input drives the network to produce an association B (which can be made up of visual, auditory, olfactory, … elements). B can elicit a new response C and so on.

Two elements are interesting and they worth an investigation:

- Using a Recurrent Neural Network (like an LSTM) to process short sequence with different components (like pictures and sounds)
- Adding a “subconscious” layer, that can influence the output according to a partially autonomous process.

View the code on Gist.

- Kosko B., Bidirectional Associative Memory, IEEE Transactions on Systems, Man, and Cybernetics, v. 18/1, 1988
- Dayan P., Abbott L. F., Theoretical Neuroscience, The MIT Press
- Trappenberg T., Fundamentals of Computational Neuroscience, Oxford University Press
- Izhikevich E. M., Dynamical Systems in Neuroscience, The MIT Press
- Rieke F., Warland D., Ruyter Van Steveninck R., Bialek W., Spikes: Exploring the Neural Code, A Bradford Book

## Lossy image autoencoders with convolution and deconvolution networks in Tensorflow – Giuseppe Bonaccorso

Fork Autoencoders are a very interesting deep learning application because they allow a consistent dimensionality reduction of an entire dataset with a controllable loss level. The Jupyter notebook for this small project is available on the Github repository: https://github.com/giuseppebonaccorso/lossy_image_autoencoder.

The post Hetero-Associative Memories for Non Experts: How “Stories” are memorized with Image-associations appeared first on Giuseppe Bonaccorso.

]]>The post A glimpse into the Self-Organizing Maps (SOM) appeared first on Giuseppe Bonaccorso.

]]>Let’s consider a dataset containing N p-dimensional samples, a suitable SOM is a matrix (other shapes, like toroids, are also possible) containing (K × L) receptors and each of them is made up of p synaptic weights. The resulting structure is a tridimensional matrix W with a shape (K × L × p).

During the learning process, a sample x is drawn from the data generating distribution X and a winning unit is computed as:

In the previous formula, I’ve adopted the vectorized notation. The process must compute the distance between x and all p-dimensional vectors W[i, j], determining the tuple (i, j) corresponding to the unit has shown the maximum activation (minimum distance). Once this winning unit has been found, a **distance function** must be computed. Consider the schema showed in the following figure:

At the beginning of the training process, we are not sure if the winning unit will remain the same, therefore we apply the update to a neighborhood centered in (i, j) and a radius which decreases proportionally to the training epoch. In the figure, for example, the winning unit can start considering also the units (2, 3) → (2, L), …, (K, 3) → (K, L), but, after some epochs, the radius becomes 1, considering only (i, j) without any other neighbour.

This function is represented as (K × L) matrix whose generic element is:

The numerator of the exponential is the Euclidean distance between the winning unit and the generic receptor (i, j). The distance is controlled by the parameter σ(t) which should decrease (possibly exponentially) with the number of epochs. Many authors (like Floreano and Mattiussi, see the reference for further information) suggest introducing a time-constant τ and defining σ(t) as:

The competitive update rule for the weights is:

Where η(t) is the learning rate, that can be fixed (for example η = 0.05) or exponentially decaying like σ(t). The weights are updated summing the Δw term:

As it’s possible to see, the weights belonging to the neighboorhood of the winning unit (determined by δf) are simply moved closer to x, while the others remain fixed (δf = 0). The role of the distance function is to impose a maximum update (which must produce the strongest response) in proximity to the winning unit and a decreasing one to all the other units. In the following figure there’s a bidimensional representation of this process:

All the units with i < i-2 and i > i+2 are kept fixed. Therefore, when the distance function becomes narrow so to include only the winning unit, a maximum selectivity principle is achieved through a competitive process. This strategy is necessary to allow the map creating the responsive areas in the most efficient way. Without a distance function, the competition could not happen at all or, in some cases, the same unit could be activated by the same patterns, driving the network to an infinite cycle.

Even if we don’t provide any proof, a Kohonen network trained using a decreasing distance function will converge to a stable state if the matrix (K × L) is large enough. In general, I suggest, to try with smaller matrices (but with KL > N) and increasing the dimensions until the desired result is achieved. Unfortunately, however, the process is quite slow and the convergence is achieved normally after thousands of iterations of the whole dataset. For this reason, it’s often preferable to reduce the dimensionality (via PCA, Kernel-PCA or NMF) and using a standard clustering algorithm.

An important element is the initialization of W. There are no best practices, however, if the weights are already close to the most significant components of the population, the convergence can be faster. A possibility is to perform a spectral decomposition (like in a PCA) to determine the principal eigenvector(s). However, when the dataset presents non-linearities, this process is almost useless. I suggest initializing the weights randomly sampling the values from a Uniform(min(X), max(X)) distribution (X is the input dataset). Another possibility is to try different initializations computing the average error on the whole dataset and taking the values that minimize this error (it’s useful to try to sample from Gaussian and Uniform distributions with a number of trials greater than 10). Sampling from a Uniform distribution, in general, produces average errors with a standard deviation < 0.1, therefore it doesn’t make sense to try too many combinations.

To test the SOM, I’ve decided to use the first 100 faces from the Olivetti dataset (*AT&T Laboratories Cambridge*), provided by Scikit-Learn. Each sample is length 4096 floats [0, 1] (corresponding to 64 × 64 grayscale images). I’ve used a matrix with shape (20 × 20), 5000 iterations, using a distance function with σ(0) = 10 and τ = 400 for the first 100 iterations and fixing the values σ = 0.01 and η = 0.1 ÷ 0.5 for the remaining ones. In order to speed up the block-computations, I’ve decided to use Cupy that works like NumPy but exploits the GPU (it’s possible to switch to NumPy simply changing the import and using the namespace np instead of cp). Unfortunately, there are many cycles and it’s not so easy to parallelize all the operations. The code is available in this GIST:

View the code on Gist.

The matrix with the weight vectors reshaped as square images (like the original samples), is shown in the following figure:

A sharped-eyed reader can see some slight differences between the faces (in the eyes, nose, mouth, and brow). The most defined faces correspond to winning units, while the others are neighbors. Consider that each of them is a synaptic weight and it must match a specific pattern. We can think of those vectors as pseudo-eigenfaces like in PCA, even if the competitive learning tries to find the values that minimize the Euclidean distance, so the objective is not finding independent components. As the dataset is limited to 100 samples, not all the face details have been included in the training set.

References:

- Floreano D., Mattiussi C., Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies, The MIT Press

See also:

## ML Algorithms addendum: Passive Aggressive Algorithms – Giuseppe Bonaccorso

Passive Aggressive Algorithms are a family of online learning algorithms (for both classification and regression) proposed by Crammer at al. The idea is very simple and their performance has been proofed to be superior to many other alternative methods like Online Perceptron and MIRA (see the original paper in the reference section).

The post A glimpse into the Self-Organizing Maps (SOM) appeared first on Giuseppe Bonaccorso.

]]>The post ML Algorithms addendum: Passive Aggressive Algorithms appeared first on Giuseppe Bonaccorso.

]]>Let’s suppose to have a dataset:

The index t has been chosen to mark the temporal dimension. In this case, in fact, the samples can continue arriving for an indefinite time. Of course, if they are drawn from same data generating distribution, the algorithm will keep learning (probably without large parameter modifications), but if they are drawn from a completely different distribution, the weights will slowly *forget* the previous one and learn the new distribution. For simplicity, we also assume we’re working with a binary classification based on bipolar labels.

Given a weight vector w, the prediction is simply obtained as:

All these algorithms are based on the Hinge loss function (the same used by SVM):

The value of L is bounded between 0 (meaning perfect match) and K depending on f(x(t),θ) with K>0 (completely wrong prediction). A Passive-Aggressive algorithm works generically with this update rule:

To understand this rule, let’s assume the slack variable ξ=0 (and L constrained to be 0). If a sample x(t) is presented, the classifier uses the current weight vector to determine the sign. If the sign is correct, the loss function is 0 and the argmin is w(t). This means that the algorithm is **passive** when a correct classification occurs. Let’s now assume that a misclassification occurred:

The angle θ > 90°, therefore, the dot product is negative and the sample is classified as -1, however, its label is +1. In this case, the update rule becomes very **aggressive**, because it looks for a new w which must be as close as possible as the previous (otherwise the existing knowledge is immediately lost), but it must satisfy L=0 (in other words, the classification must be correct).

The introduction of the slack variable allows to have soft-margins (like in SVM) and a degree of tolerance controlled by the parameter C. In particular, the loss function has to be L <= ξ, allowing a larger error. Higher C values yield stronger aggressiveness (with a consequent higher risk of destabilization in presence of noise), while lower values allow a better adaptation. In fact, this kind of algorithms, when working online, must cope with the presence of noisy samples (with wrong labels). A good robustness is necessary, otherwise, too rapid changes produce consequent higher misclassification rates.

After solving both update conditions, we get the closed-form update rule:

This rule confirms our expectations: the weight vector is updated with a factor whose sign is determined by y(t) and whose magnitude is proportional to the error. Note that if there’s no misclassification the nominator becomes 0, so w(t+1) = w(t), while, in case of misclassification, w will rotate towards x(t) and stops with a loss L <= ξ. In the next figure, the effect has been marked to show the rotation, however, it’s normally as smallest as possible:

After the rotation, θ < 90° and the dot product becomes negative, so the sample is correctly classified as +1. Scikit-Learn implements Passive Aggressive algorithms, but I preferred to implement the code, just to show how simple they are. In next snippet (also available in this GIST), I first create a dataset, then compute the score with a Logistic Regression and finally apply the PA and measure the final score on a test set:

View the code on Gist.

For regression, the algorithm is very similar, but it’s now based on a slightly different Hinge loss function (called ε-insensitive):

The parameter ε determines a tolerance for prediction errors. The update conditions are the same adopted for classification problems and the resulting update rule is:

Just like for classification, Scikit-Learn implements also a Regression, however, in the next snippet (also available in this GIST), there’s a custom implementation:

View the code on Gist.

The error plot is shown in the following figure:

The quality of the regression (in particular, the length of the transient period when the error is high) can be controlled by picking better C and ε values. In particular, I suggest checking different range centers for C (100, 10, 1, 0.1, 0.01), in order to determine whether a higher aggressiveness is preferable.

References:

- Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y., Online Passive-Aggressive Algorithms, Journal of Machine Learning Research 7 (2006) 551–585

See also:

## ML Algorithms Addendum: Instance Based Learning – Giuseppe Bonaccorso

Contrary to the majority of machine learning algorithms, Instance Based Learning is model-free, meaning that there are strong assumptions about the structure of regressors, classifiers or clustering functions. They are “simply” determined by the data, according to an affinity induced by a distance metric (the most common name for this approach is Nearest Neighbors).

The post ML Algorithms addendum: Passive Aggressive Algorithms appeared first on Giuseppe Bonaccorso.

]]>