The post Mastering Machine Learning Algorithms appeared first on Giuseppe Bonaccorso.

]]>From the back cover:

Machine learning is a subset of AI that aims to make modern-day computer systems smarter and more intelligent. The real power of machine learning resides in its algorithms, which make even the most difficult things capable of being handled by machines. However, with the advancement in the technology and requirements of data, machines will have to be smarter than they are today to meet the overwhelming data needs; mastering these algorithms and using them optimally is the need of the hour.

Mastering Machine Learning Algorithms is your complete guide to quickly getting to grips with popular machine learning algorithms. You will be introduced to the most widely used algorithms in supervised, unsupervised, and semi-supervised machine learning, and will learn how to use them in the best possible manner. Ranging from Bayesian models to the MCMC algorithm to Hidden Markov models, this book will teach you how to extract features from your dataset and perform dimensionality reduction by making use of Python-based libraries such as scikit-learn. You will also learn how to use Keras and TensorFlow to train effective neural networks.

If you are looking for a single resource to study, implement, and solve end-to-end machine learning problems and use-cases, this is the book you need.

ISBN: 9781788621113

Publisher page: https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-algorithms

Code samples: https://github.com/PacktPublishing/Mastering-Machine-Learning-Algorithms

Safari books: https://www.safaribooksonline.com/library/view/mastering-machine-learning/9781788621113/

Google Books: https://books.google.de/books?id=2HteDwAAQBAJ&printsec=frontcover&dq=mastering+machine+learning+algorithms

Gitter chatroom: https://gitter.im/Machine-Learning-Algorithms/Lobby

## Machine Learning Algorithms – Giuseppe Bonaccorso

My latest machine learning book has been published and will be available during the last week of July. From the back cover: In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science.

The post Mastering Machine Learning Algorithms appeared first on Giuseppe Bonaccorso.

]]>The post Fundamentals of Machine Learning with Scikit-Learn appeared first on Giuseppe Bonaccorso.

]]>From the notes:

As the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine Learning applications are everywhere, from self-driving cars, spam detection, document searches, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of big data and data science. The main challenge is how to transform data into actionable knowledge.

In this course you will learn all the important Machine Learning algorithms that are commonly used in the field of data science. These algorithms can be used for supervised as well as unsupervised learning, reinforcement learning, and semi-supervised learning. A few famous algorithms that are covered in this book are: Linear regression, Logistic Regression, SVM, Naive Bayes, K-Means, Random Forest, and Feature engineering. In this course, you will also learn how these algorithms work and their practical implementation to resolve your problems.

ISBN: 9781789134377

Link to the publisher page: https://www.packtpub.com/big-data-and-business-intelligence/fundamentals-machine-learning-scikit-learn-video

## Machine Learning Algorithms – Giuseppe Bonaccorso

My latest machine learning book has been published and will be available during the last week of July. From the back cover: In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science.

The post Fundamentals of Machine Learning with Scikit-Learn appeared first on Giuseppe Bonaccorso.

]]>The post Getting Started with NLP and Deep Learning with Python appeared first on Giuseppe Bonaccorso.

]]>As the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine Learning applications are everywhere, from self-driving cars to spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of Big Data and Data Science. The main challenge is how to transform data into actionable knowledge.

In this course, you’ll be introduced to the Natural Processing Language and Recommendation Systems, which help you run multiple algorithms simultaneously. Also, you’ll learn about Deep learning and TensorFlow. Finally, you’ll see how to create an Ml architecture.

ISBN: 9781789138894

Link to the publisher page: https://www.packtpub.com/big-data-and-business-intelligence/getting-started-nlp-and-deep-learning-python-video

## Machine Learning Algorithms – Giuseppe Bonaccorso

My latest machine learning book has been published and will be available during the last week of July. From the back cover: In this book you will learn all the important Machine Learning algorithms that are commonly used in the field of data science.

The post Getting Started with NLP and Deep Learning with Python appeared first on Giuseppe Bonaccorso.

]]>The post Hetero-Associative Memories for Non Experts: How “Stories” are memorized with Image-associations appeared first on Giuseppe Bonaccorso.

]]>Computer science drove us to think that memories must always be loss-less, efficient and organized like structured repositories. They can be split into standard-size slots and every element can be stored into one or more slots. Once done, it’s enough to save two references: a pointer (a positional number, a couple of coordinates, or any other element) and the number of slots. For example, the book “*War and Peace*” (let’s suppose its length is 1000 units) can be stored in a memory at the position 684, so, the reference couple is (684, 1000). When necessary, it’s enough to retrieve 1000 units starting from the position 684 and every single, exact word written by Tolstoj will appear in front of your eyes.

A RAM (**Random Access Memory**) works in this way (as well as Hard Disks and other similar supports). Is this an efficient storage strategy? Of course, it is, and every programmer knows how to work with this kind of memory. A variable has a type (which also determines its size). An array has a type and a dimensional structure. In both cases, using names and indexes, it’s possible to access to every element at a very high speed.

Now, let’s think about the small prey again. Is its memory structured in this way? Let’s suppose it is. The process can be described with the following steps: the sound is produced, the pressure waves propagate at about 330 m/s and arrive at the ears. A complex mechanism transforms the sound into an electric signal which is sent to the brain where a series of transformations should drive the recognition. If the memory were structured like a book-shelf, a scanning process should be performed, comparing each pattern with the new one. The most similar element has to be found as soon as possible and the consequent action has to be chosen. The worst-case algorithm for a memory with n locations has this structure:

- For each memory location i:
- If Memory[i] == Element:
- return Element

- Else Continue

- If Memory[i] == Element:

The computational cost is O(n), which is not so bad but requires a maximum number of n comparisons. A better solution is based on the concept of *hash* or signature (for further information see Hash functions). Each element is associated with a (unique) hash, which can be an integer number used as an index for an over-complete array (N >> n). The computational cost for a good hash function is constant O(1) and so the retrieval phase (because the hash is normally *almost unique*), however, another problem arises. A RAM-like memory needs exact locations or a complete scan, but instead of direct comparisons, it’s possible to use a similarity measure (that introduces some fuzziness, allowing to match noisy patterns). With a hash function, the computational cost is dramatically reduced, but the similarity becomes almost impossible because the algorithms are studied in order to generate complete different hashes even for very small changes in the input.

At the end of the day, this kind of memory has too many problems and a natural question is: how can animals manage them all? Indeed animal brains avoid these problems completely. Their memory isn’t a RAM and all the pieces of information are stored in a completely different way. Without considering all the distinctions introduced by cognitive psychologists (short-term, long-term, work memory, and so on), we can say that an input pattern A, after some processing steps is transformed into another pattern B:

Normally B has the same *abilities* of an original stimulus. This means that a correctly recognized roar elicits the same response of the sight of an actual lion, allowing a sort of prediction or action anticipation. Moreover, if A is partially corrupted with respect to the original version (here we’re assuming Gaussian noise), the function is able to denoise its output:

This approach is called **associative **and it has been studied by several researchers (like [1] and [2]) in the fields of computer science and computational neurosciences. Many models (sometimes completely different in their mathematical formulation) have been designed and engineered (like BAMs, SOMs, and Hopfield Networks). However, their inner logic is always the same: a set of similar patterns (in terms of coarse-grained/fine-grained features) must elicit a similar response and the inference time must as shortest as possible. If you want to briefly understand how these some of these models work, you can check these previous articles:

In order to summarize this idea, you can consider the following figure:

The blue line is the representations of a **memory-surface**. At time t=0, nothing has been stored and the line is straight. After some experiences, two basins appear. With some fantasy, if the image a new cat is sent close to the basin, it will fall down till reaching the minimum point, where the concept of “cat” is stored. The same happens for the category of “trucks” and so forth for any other semantic element associated with a specific perception. Even if this approach is based on the concept of energy and needs a dynamic evolution, it can be elegantly employed to explain the difference between random access and associate one. At the same time, starting from a basin (which is a minimum in the memory-surface), it’s possible to retrieve a family of patterns and their common features. This is what Immanuel Kant defined as **figurative synthesis** and represents one of the most brilliant results allowed by the neocortex.

In fact, if somebody asks a person to think about a cat (assuming that this concept is not too familiar), no specific images will be retrieved. On the contrary, a generic, feature-based representation is evoked and adapted to any possible instance belonging to the same family. To express this concept in a more concise way, we can say that we can recover whole concepts through the collection of common features and, if necessary, we can match these features with a real instance.

For examples, there are some dogs that are not very dissimilar to some cats and it’s natural asking: it is a dog or a cat? In this case, the set of features has a partial overlap and we need to collect further pieces of information to reach a final decision. At the same time, if a cat is hidden behind a curtain and someone invites a friend to imagine it, all the possible features belonging to the concept “cat” will be recovered to allow guessing color, breed and so on. Try yourself. Maybe the friend knows that fluffy animals are preferred, so he/she is driven to create the model on a Persian. However, after a few seconds, a set of attributes is ready to be communicated.

Surprisingly, when the curtain is opened, a white bunny appears. In this case, the process is a little bit more complex because the person trusted his/her friend and implicitly assigned a very high priority to all pieces of information (also previously collected). In terms of probability, we say that the prior distribution was peaked around the concept of “cat”, avoiding spurious features to corrupt the mental model. (In the previous example, there were probably two smaller peaks around the concept of “cat” and “dog”, so the model could be partially noisy, allowing more freedom of choice).

When the image appears, almost none of the main predicted features matched with the bunny, driving the brain to *reset* its belief (not immediately because the prior keeps a minimum doubt). Luckily, this person has seen many rabbits before that moment and even, after all the wrong indications, his/her associative memories can rapidly recover a new concept, allowing the final decision that the animal isn’t a cat. A hard-drive had to go back and forth many times, slowing down the process in a dramatic way.

An hetero-encoder is structurally identical to an auto-encoder (see Lossy Image Autoencoders). The only difference is the association: the latter trains a model in order to obtain:

While an hetero-encoder trains a model that is able to perform the association:

The source code is reported in this GIST and at the end of the article. It’s based on Tensorflow (Python 3.5) and it’s split in an encoding part, which is a small convolutional network followed by a dense layer. Its role is to transform the input (Batch size × 32 × 32 × 3) into a feature vector that can be fed into the decoder. This one processes the feature vector with a couple of dense layers and performs a deconvolution (transpose convolution) to build the output (Batch size × 32 × 32 × 3).

The model is trained using an L2 loss function computed on the difference between expected and predicted output. An extra L1 loss can be added to the feature vector to increase the sparsity. The training process takes a few minutes with GPU support and 500 epochs.

The model itself is not complex, nor based on rocket science, but a few considerations are useful to understand why this approach is really important:

- The model is implicitly cumulative. In other words, the function g(•) works with all the input images, transforming them into the corresponding output.
- No If statements are present. In the common algorithmic logic, g(•) should check for the input image and select the right transformation. On the contrary, a neural model can make this choice implicitly.
- All pattern transformations are stored in the parameter set whose plasticity allows a continuous training
- A noisy version of the input elicits a response whose L2 loss with the original one is minimized. Increasing the complexity of both encoder and decoder, it’s possible to further increase the noise robustness. This is a fundamental concept because it’s almost impossible to have two identical perceptions.

In our “experiment”, the destination dataset is a shuffled version of the original one, therefore different periodic sequences are possible. I’m going to show this with some fictional fantasy. Each sequence has the fixed length of 20 pictures (19 associations). The first picture is freely chosen, while all the others are generated with a chain process.

I’ve randomly selected (with fixed seed) 50 Cifar-10 pictures (through Keras dataset loading function) as building blocks for the hetero-associative memory. Unfortunately, the categories are quite weird (frogs, ostriches, deer, together with cars, planes, trucks, and, obviously, lovely cats), but they allow using some fantasy in recreating the possible outline. The picture *collage* is shown in the following figure:

The original sequence (X_source) is then shuffled and turned into a destination sequence (X_dest). In this way, each original image will be always associated with another one belonging to the same group and different periodic sequences can be discovered.

The user can enlarge the dataset and experiment different combinations. In this case, with 50 samples I’ve discovered a few interesting sequences, that I’ve called “stories”.

John is looking at a boat, while a noisy truck behind him drew his attention to a concrete wall that was being built behind him.

While walking in a pet-shop, John saw an exotic frog and suddenly remembered that he needed to buy some food for his cat. The favorite brand is Acme Corp. and he saw several times their truck. Meanwhile, the frog croaks and he turns his head towards another terrarium. The color of the sand and the artificial landscape drive him to think about a ranch where he rode a horse for the first time. While trotting next to a pond, a duck drew his attention and he was about to fall down. In that moment the frog croaked again and he decided to harry up.

John is looking at a small bird, while he remembers his grandpa, who was a passionate bird-watcher. He had a light-blue old car and when John was a child, during a short trip with his grandpa, he saw an ostrich. Their small dog began barking and he asked his grandpa to speed up, but he answered: “Hey, this is not a Ferrari!” (Ok, my fantasy is going too fast…). During the same trip, they saw a deer and John took a photo with his brand new camera.

Story A:

Story B:

These two cases are a little bit more complex and I prefer not fantasize. The concept is always the same: an input drives the network to produce an association B (which can be made up of visual, auditory, olfactory, … elements). B can elicit a new response C and so on.

Two elements are interesting and they worth an investigation:

- Using a Recurrent Neural Network (like an LSTM) to process short sequence with different components (like pictures and sounds)
- Adding a “subconscious” layer, that can influence the output according to a partially autonomous process.

View the code on Gist.

- Kosko B., Bidirectional Associative Memory, IEEE Transactions on Systems, Man, and Cybernetics, v. 18/1, 1988
- Dayan P., Abbott L. F., Theoretical Neuroscience, The MIT Press
- Trappenberg T., Fundamentals of Computational Neuroscience, Oxford University Press
- Izhikevich E. M., Dynamical Systems in Neuroscience, The MIT Press
- Rieke F., Warland D., Ruyter Van Steveninck R., Bialek W., Spikes: Exploring the Neural Code, A Bradford Book

## Lossy image autoencoders with convolution and deconvolution networks in Tensorflow – Giuseppe Bonaccorso

Fork Autoencoders are a very interesting deep learning application because they allow a consistent dimensionality reduction of an entire dataset with a controllable loss level. The Jupyter notebook for this small project is available on the Github repository: https://github.com/giuseppebonaccorso/lossy_image_autoencoder.

The post Hetero-Associative Memories for Non Experts: How “Stories” are memorized with Image-associations appeared first on Giuseppe Bonaccorso.

]]>The post A glimpse into the Self-Organizing Maps (SOM) appeared first on Giuseppe Bonaccorso.

]]>Let’s consider a dataset containing N p-dimensional samples, a suitable SOM is a matrix (other shapes, like toroids, are also possible) containing (K × L) receptors and each of them is made up of p synaptic weights. The resulting structure is a tridimensional matrix W with a shape (K × L × p).

During the learning process, a sample x is drawn from the data generating distribution X and a winning unit is computed as:

In the previous formula, I’ve adopted the vectorized notation. The process must compute the distance between x and all p-dimensional vectors W[i, j], determining the tuple (i, j) corresponding to the unit has shown the maximum activation (minimum distance). Once this winning unit has been found, a **distance function** must be computed. Consider the schema showed in the following figure:

At the beginning of the training process, we are not sure if the winning unit will remain the same, therefore we apply the update to a neighborhood centered in (i, j) and a radius which decreases proportionally to the training epoch. In the figure, for example, the winning unit can start considering also the units (2, 3) → (2, L), …, (K, 3) → (K, L), but, after some epochs, the radius becomes 1, considering only (i, j) without any other neighbour.

This function is represented as (K × L) matrix whose generic element is:

The numerator of the exponential is the Euclidean distance between the winning unit and the generic receptor (i, j). The distance is controlled by the parameter σ(t) which should decrease (possibly exponentially) with the number of epochs. Many authors (like Floreano and Mattiussi, see the reference for further information) suggest introducing a time-constant τ and defining σ(t) as:

The competitive update rule for the weights is:

Where η(t) is the learning rate, that can be fixed (for example η = 0.05) or exponentially decaying like σ(t). The weights are updated summing the Δw term:

As it’s possible to see, the weights belonging to the neighboorhood of the winning unit (determined by δf) are simply moved closer to x, while the others remain fixed (δf = 0). The role of the distance function is to impose a maximum update (which must produce the strongest response) in proximity to the winning unit and a decreasing one to all the other units. In the following figure there’s a bidimensional representation of this process:

All the units with i < i-2 and i > i+2 are kept fixed. Therefore, when the distance function becomes narrow so to include only the winning unit, a maximum selectivity principle is achieved through a competitive process. This strategy is necessary to allow the map creating the responsive areas in the most efficient way. Without a distance function, the competition could not happen at all or, in some cases, the same unit could be activated by the same patterns, driving the network to an infinite cycle.

Even if we don’t provide any proof, a Kohonen network trained using a decreasing distance function will converge to a stable state if the matrix (K × L) is large enough. In general, I suggest, to try with smaller matrices (but with KL > N) and increasing the dimensions until the desired result is achieved. Unfortunately, however, the process is quite slow and the convergence is achieved normally after thousands of iterations of the whole dataset. For this reason, it’s often preferable to reduce the dimensionality (via PCA, Kernel-PCA or NMF) and using a standard clustering algorithm.

An important element is the initialization of W. There are no best practices, however, if the weights are already close to the most significant components of the population, the convergence can be faster. A possibility is to perform a spectral decomposition (like in a PCA) to determine the principal eigenvector(s). However, when the dataset presents non-linearities, this process is almost useless. I suggest initializing the weights randomly sampling the values from a Uniform(min(X), max(X)) distribution (X is the input dataset). Another possibility is to try different initializations computing the average error on the whole dataset and taking the values that minimize this error (it’s useful to try to sample from Gaussian and Uniform distributions with a number of trials greater than 10). Sampling from a Uniform distribution, in general, produces average errors with a standard deviation < 0.1, therefore it doesn’t make sense to try too many combinations.

To test the SOM, I’ve decided to use the first 100 faces from the Olivetti dataset (*AT&T Laboratories Cambridge*), provided by Scikit-Learn. Each sample is length 4096 floats [0, 1] (corresponding to 64 × 64 grayscale images). I’ve used a matrix with shape (20 × 20), 5000 iterations, using a distance function with σ(0) = 10 and τ = 400 for the first 100 iterations and fixing the values σ = 0.01 and η = 0.1 ÷ 0.5 for the remaining ones. In order to speed up the block-computations, I’ve decided to use Cupy that works like NumPy but exploits the GPU (it’s possible to switch to NumPy simply changing the import and using the namespace np instead of cp). Unfortunately, there are many cycles and it’s not so easy to parallelize all the operations. The code is available in this GIST:

View the code on Gist.

The matrix with the weight vectors reshaped as square images (like the original samples), is shown in the following figure:

A sharped-eyed reader can see some slight differences between the faces (in the eyes, nose, mouth, and brow). The most defined faces correspond to winning units, while the others are neighbors. Consider that each of them is a synaptic weight and it must match a specific pattern. We can think of those vectors as pseudo-eigenfaces like in PCA, even if the competitive learning tries to find the values that minimize the Euclidean distance, so the objective is not finding independent components. As the dataset is limited to 100 samples, not all the face details have been included in the training set.

References:

- Floreano D., Mattiussi C., Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies, The MIT Press

See also:

## ML Algorithms addendum: Passive Aggressive Algorithms – Giuseppe Bonaccorso

Passive Aggressive Algorithms are a family of online learning algorithms (for both classification and regression) proposed by Crammer at al. The idea is very simple and their performance has been proofed to be superior to many other alternative methods like Online Perceptron and MIRA (see the original paper in the reference section).

The post A glimpse into the Self-Organizing Maps (SOM) appeared first on Giuseppe Bonaccorso.

]]>The post ML Algorithms addendum: Passive Aggressive Algorithms appeared first on Giuseppe Bonaccorso.

]]>Let’s suppose to have a dataset:

The index t has been chosen to mark the temporal dimension. In this case, in fact, the samples can continue arriving for an indefinite time. Of course, if they are drawn from same data generating distribution, the algorithm will keep learning (probably without large parameter modifications), but if they are drawn from a completely different distribution, the weights will slowly *forget* the previous one and learn the new distribution. For simplicity, we also assume we’re working with a binary classification based on bipolar labels.

Given a weight vector w, the prediction is simply obtained as:

All these algorithms are based on the Hinge loss function (the same used by SVM):

The value of L is bounded between 0 (meaning perfect match) and K depending on f(x(t),θ) with K>0 (completely wrong prediction). A Passive-Aggressive algorithm works generically with this update rule:

To understand this rule, let’s assume the slack variable ξ=0 (and L constrained to be 0). If a sample x(t) is presented, the classifier uses the current weight vector to determine the sign. If the sign is correct, the loss function is 0 and the argmin is w(t). This means that the algorithm is **passive** when a correct classification occurs. Let’s now assume that a misclassification occurred:

The angle θ > 90°, therefore, the dot product is negative and the sample is classified as -1, however, its label is +1. In this case, the update rule becomes very **aggressive**, because it looks for a new w which must be as close as possible as the previous (otherwise the existing knowledge is immediately lost), but it must satisfy L=0 (in other words, the classification must be correct).

The introduction of the slack variable allows to have soft-margins (like in SVM) and a degree of tolerance controlled by the parameter C. In particular, the loss function has to be L <= ξ, allowing a larger error. Higher C values yield stronger aggressiveness (with a consequent higher risk of destabilization in presence of noise), while lower values allow a better adaptation. In fact, this kind of algorithms, when working online, must cope with the presence of noisy samples (with wrong labels). A good robustness is necessary, otherwise, too rapid changes produce consequent higher misclassification rates.

After solving both update conditions, we get the closed-form update rule:

This rule confirms our expectations: the weight vector is updated with a factor whose sign is determined by y(t) and whose magnitude is proportional to the error. Note that if there’s no misclassification the nominator becomes 0, so w(t+1) = w(t), while, in case of misclassification, w will rotate towards x(t) and stops with a loss L <= ξ. In the next figure, the effect has been marked to show the rotation, however, it’s normally as smallest as possible:

After the rotation, θ < 90° and the dot product becomes negative, so the sample is correctly classified as +1. Scikit-Learn implements Passive Aggressive algorithms, but I preferred to implement the code, just to show how simple they are. In next snippet (also available in this GIST), I first create a dataset, then compute the score with a Logistic Regression and finally apply the PA and measure the final score on a test set:

View the code on Gist.

For regression, the algorithm is very similar, but it’s now based on a slightly different Hinge loss function (called ε-insensitive):

The parameter ε determines a tolerance for prediction errors. The update conditions are the same adopted for classification problems and the resulting update rule is:

Just like for classification, Scikit-Learn implements also a Regression, however, in the next snippet (also available in this GIST), there’s a custom implementation:

View the code on Gist.

The error plot is shown in the following figure:

The quality of the regression (in particular, the length of the transient period when the error is high) can be controlled by picking better C and ε values. In particular, I suggest checking different range centers for C (100, 10, 1, 0.1, 0.01), in order to determine whether a higher aggressiveness is preferable.

References:

- Crammer K., Dekel O., Keshet J., Shalev-Shwartz S., Singer Y., Online Passive-Aggressive Algorithms, Journal of Machine Learning Research 7 (2006) 551–585

See also:

## ML Algorithms Addendum: Instance Based Learning – Giuseppe Bonaccorso

Contrary to the majority of machine learning algorithms, Instance Based Learning is model-free, meaning that there are strong assumptions about the structure of regressors, classifiers or clustering functions. They are “simply” determined by the data, according to an affinity induced by a distance metric (the most common name for this approach is Nearest Neighbors).

The post ML Algorithms addendum: Passive Aggressive Algorithms appeared first on Giuseppe Bonaccorso.

]]>The post A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms appeared first on Giuseppe Bonaccorso.

]]>First of all, it’s necessary to standardize the naming. In some books, the expression “**Stochastic Gradient Descent**” refers to an algorithm which operates on a batch size equal to 1, while “**Mini-batch Gradient Descent**” is adopted when the batch size is greater than 1. In this context, we assume that Stochastic Gradient Descent operates on batch sizes equal or greater than 1. In particular, if we define the loss function for a single sample as:

where x is the input sample, y the label(s) and θ is the parameter vector, we can also define the *partial* cost function considering a batch size equal to N:

The *vanilla* Stochastic Gradient Descent algorithm is based on a θ update rule that must move the weights in opposite direction of the gradient of L (the gradient points always toward a maximum):

This process is represented in the following figure:

α is the learning rate, while θstart is the initial point and θopt is the global minimum we’re looking for. In a standard optimization problem, without particular requirements, the algorithm converges in a limited number of iterations. Unfortunately, the reality is a little bit different, in particular in deep models, where the number of parameters is in the order of ten or one hundred million. When the system is relatively shallow, it’s easier to find **local minima** where the training process can stop, while in deeper models, the probability of a local minimum becomes smaller and, instead, saddle points become more and more likely.

To understand this concept, assuming a generic vectorial function L(θ), the conditions for a point to be a minimum are:

If the Hessian matrix is (m x m) where m is the number of parameters, the second condition is equivalent to say that all m eigenvalues must be non-negative (or that all principal minors composed with the first n rows and n columns must be non-negative).

From a probabilistic viewpoint, P(H positive semi-def.) → 0 when m → ∞, therefore local minima are rare in deep models (this doesn’t mean that local minima are impossible, but their relative *weight* is lower and lower in deeper models). If the Hessian matrix has both positive and negative eigenvalues (and the gradient is null), the Hessian is said to be indefinite and the point is called **saddle point**. In this case, the point is maximum considering an orthogonal projection and a minimum for another one. In the following figure, there’s an example created with a 3-dimensional cost function:

It’s obvious that both local minima and saddle points are problematic and the optimization algorithm should be able to avoid them. Moreover, there are sometimes conditions called **plateaus**, where L(θ) is almost flat in a very wide region. This drives the gradient to become close to zero with an increased risk to stop at a sub-optimal point. This example is shown in the next figure:

For these reasons, now we’re going to discuss some common methods to improve the performance of a *vanilla* Stochastic Gradient Descent algorithm.

A very simple approach to the problem of plateaus is adding a small noisy term (Gaussian noise) to the gradient:

The variance should be carefully chosen (for example, it could decay exponentially during the epochs). However, this method can be a very simple and effective solution to allow a *movement* even in regions where the gradient is close to zero.

The previous approach was quite simple and in many cases, it can difficult to implement. A more robust solution is provided by introducing an exponentially weighted moving average for the gradients. The idea is very intuitive: instead of considering only the current gradient, we can *attach* part of its history to the correction factor, so to avoid an abrupt change when the surface becomes flat. The **Mo****mentum** algorithm is, therefore:

The first equation computes the correction factor considering the weight μ. If μ is small, the previous gradients are soon discarded. If, instead, μ → 1, their effect continues for a longer time. A common value in many application is between 0.75 and 0.99, however, it’s important to consider μ as a hyperparameter to adjust in every application. The second term performs the parameter update. In the following figure, there’s a vectorial representation of a Momentum step:

A slightly different variation is provided by the **Nesterov Momentum**. The difference with the base algorithm is that we first apply the correction with the current factor v(t) to determine the gradient and then compute v(t+1) and correct the parameters:

This algorithm can improve the converge speed (the result has been theoretically proven by Nesterov in the scope of mathematical optimization), however, in deep learning contexts, it doesn’t seem to produce excellent results. It can be implemented alone or in conjunction with the other algorithms that we’re going to discuss.

This algorithm, proposed by G. Hinton, is based on the idea to adapt the correction factor for each parameter, so to increase the effect on slowly-changing parameters and reduce it when their change magnitude is very large. This approach can dramatically improve the performance of a deep network, but it’s a little bit more expensive than Momentum because we need to compute a *speed* term for each parameter:

This term computes the exponentially weighted moving average of the gradient squared (element-wise). Just like for Momentum, μ determines how fast the previous speeds are forgotten. The parameter update rule is:

α is the learning rate and δ is a small constant (~ 1e-6 ÷ 1e-5) introduced to avoid a division by zero when the speed is null. As it’s possible to see, each parameter is updated with a rule that is very similar to the *vanilla* Stochastic Gradient Descent, but the actual learning rate is adjusted per single parameter using the reciprocal of the square root of the relative speed. It’s easy to understand that large gradients determine large speeds and, adaptively, the corresponding update is smaller and vice-versa. **RMSProp** is a very powerful and flexible algorithm and it is widely used in Deep Reinforcement Learning, CNN, and RNN-based projects.

**Adam** is an adaptive algorithm that could be considered as an extension of **RMSProp**. Instead of considering the only exponentially weighted moving average of the gradient square, it computes also the same value for the gradient itself:

μ1 and μ2 are forgetting factors like in the other algorithms. The authors suggest values greater than 0.9. As both terms are moving estimations of the first and the second moment, they can be biased (see this article for further information). Adam provided a bias correction for both terms:

The parameter update rule becomes:

This rule can be considered as a standard RMSProp one with a momentum term. In fact, the correction term is made up of: (numerator) the moving average of the gradient and (denominator) the adaptive factor to modulate the magnitude of the change so to avoid different changing *amplitudes* for different parameters. The constant δ, like for RMSProp, should be set equal to 1e-6 and it’s necessary to improve the numerical stability.

This is another adaptive algorithm based on the idea to consider the historical sum of the gradient square and set the correction factor of a parameter so to scale its value with the reciprocal of the squared historical sum. The concept is not very different from RMSProp and Adam, but, in this case, we don’t use an exponentially weighted moving average, but the whole history. The accumulation step is:

While the update step is exactly as RMSProp:

**AdaGrad** shows good performances in many tasks, increasing the convergence speed, but it has a drawback deriving from accumulating the whole squared gradient history. As each term is non-negative, g → ∞ and the correction factor (which is the adaptive learning rate) → 0. Therefore, during the first iterations, AdaGrad can produce significant changes, but at the end of a long training process, the change rate is almost null.

**AdaDelta** is algorithm proposed by M. D. Zeiler to solve the problem of AdaGrad. The idea is to consider a limited window instead of accumulating for the whole history. In particular, this result is achieved using an exponentially weighted moving average (like for RMSProp):

A very interesting expedient introduced by AdaDelta is to normalize the parameter updates so to have the same unit of the parameter. In fact, if we consider RMSProp, we have:

This means that an update is unitless. To avoid this problem, AdaDelta computes the exponentially weighted average of the squared updates and apply this update rule:

This algorithm showed very good performances in many deep learning tasks and it’s less expensive than AdaGrad. The best value for the constant δ should be assessed through cross-validation, however, 1e-6 is a good default for many tasks, while μ can be set greater than 0.9 in order to avoid a predominance of old gradients and updates.

Stochastic Gradient Descent is intrinsically a powerful method, however, in non-convex scenarios, its performances can be degraded. We have explored different algorithms (most of them are currently the first choice in deep learning tasks), showing their strengths and weaknesses. Now the question is: which is the best? The answer, unfortunately, doesn’t exist. All of them can perform well in some contexts and bad in others. In general, all adaptive methods tend to show similar behaviors, but every problem is a separate universe and the only silver bullet we have is trial and error. I hope the exploration has been clear and any constructive comment or question is welcome!

- Goodfellow I., Bengio Y., Courville A., Deep Learning, The MIT Press
- Duchi J., Hazan E., Singer Y., Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Journal of Machine Learning Research 12 (2011) 2121-2159
- Zeiler M. D., AdaDelta: An Adaptive Learning Rate Method, arXiv:1212.5701

See also:

## Quickprop: an almost forgotten neural training algorithm – Giuseppe Bonaccorso

Standard Back-propagation is probably the best neural training algorithm for shallow and deep networks, however, it is based on the chain rule of derivatives and an update in the first layers requires a knowledge back-propagated from the last layer.

The post A Brief (and Comprehensive) Guide to Stochastic Gradient Descent Algorithms appeared first on Giuseppe Bonaccorso.

]]>The post A virtual Jacques Lacan discusses about Artificial Intelligence appeared first on Giuseppe Bonaccorso.

]]>this is a given. He is even caught in it before his birth.”

A virtual discussion with Jacques Lacan is a very hard task, above all when the main topic is Artificial Intelligence, a discipline that maybe he heard about but still too far from the world where he lived in. However, I believe that many concepts belonging to his theory are fundamental for any discipline that has to study the huge variety of human behaviors. Of course, this is a personal (and limited) reinterpretation that can make may psychoanalysts and philosophers smile, but I do believe in freedom of expression and all the constructive comments are welcome. But let’s begin our virtual discussion!

PS: Someone hasn’t understood that this is a dialog where I wrote all utterances (believe it or not) and asked me if I had trained a RNN. Well, no. *This isn’t Artificial Intelligence. It’s about Artificial Intelligence*.

**G.B**. – I’d like to start this discussion with a very hot topic. Are you afraid of Artificial Intelligence?

**J.L.** – I’m afraid of human desires when they’re interpreted as normal daily messages.

**G.B**. – Can you explain this concept with further details?

**J.L. **– A desire is a strange thing. Very strange, to be honest. It can be simple, but at the same time, it can hide something we cannot decode. I’m really afraid of all powerful people because they can start thinking that they deserve a special *jouissance *(in English, think about terms like *pleasure, enjoyment, possession*).

**G.B**. – What’s so strange about it?

**J.L. **– The strange thing is something that Marx understood perfectly and that I’ve called plus-*jouissance.* The political power is a particular territory where a person can think to re-find his or her mother’s breast. In other words, I’m afraid when a person thinks that it’s possible to kill, impose absurd laws and get richer and richer only because the majority of other people stay on the other side.

**G.B **– Is it enough to be afraid of Artificial Intelligence?

**J.L. **– It’s enough to be afraid of many things, but we need also to trust human beings. Artificial Intelligence is becoming another object where the plus-*jouissance* can focus its attention. Sincerely I don’t like Science-Fiction and I don’t think about AI like another nuclear power. Maybe it’s only a bluff, hype to feed some marketing manager, but it seems that Artificial Intelligence had a weird behavior. It’s much less pragmatic than a hydrogen bomb, but it’s much more flexible.

**G.B **– Maybe I’m beginning to understand. When you say: *flexible*, do you mean that it has more degrees of freedom?

**J.L. **– Degrees of freedom as attraction points.

**G.B **– A crazy politician could start a war even without knowing what AI really is, am I correct?

**J.L. **– He knows perfectly what AI is. AI is a toy that is asking to become a womb for new plus-jouissance*.*

**G.B **– Well, but let’s talk about the results. I’d like to be straight again: are you a reductionist?

**J.L. **– Freud was a reductionist, but he didn’t succeed in finding all the evidence. I do believe that human beings are animals, but quite different from dogs and monkeys. Look at this diagram:

We can suppose that until a certain instant, all animals were similar and they developed their intelligent abilities according to the structure of their brain. However, there was an instant when human beings started developing the most important concept of the whole history: language. It created a singularity, altering the smoothness of the evolution line.

**G.B **– Why is language so important?

**J.L. **– It’s dramatically important because it broke the original relationship with the *jouissance. *Human beings began describing and representing the world using their language. This allowed solving many problems, asking more and more questions and finding more and more answers, but… They had to pay a price. If they were somehow complete before the singularity (this isn’t a scientific assumption), after that moment, they found themselves with a missing part.

**G.B** – Which part?

**J.L. **– Impossible to say. The missing part is exactly what constantly escapes the *signification* process.

**G.B **– Well, but how can this explain the power of Artificial Intelligence?

**J.L. **– Probably it doesn’t, but, on my opinion, the best arena for AI is not a computational activity. It’s the language. Do you know the difference between *moi* and *je*?

**G.B **– *Moi* is the subject who talks, while* je* is the subject who desires. Correct?

**J.L. **– More or less. Until a certain point, it’s easy to reduce some brain process to the functions of a *moi*. A well-developed, advanced chatbot could be considered as a limited human being who always lived inside a small universe.

**G.B **– It’s not very hard to imagine a chat with such an artificial agent.

**J.L. **– Yes, it’s possible. But my question is: did it (or he, she?) entered the language world like any human being?

**G.B **– Are you wondering if it has lost the *jouissance*?

**J.L. **– Exactly. Now I can seem mad, but can a chatbot desire? Does a chatbot *je* exist?

**G.B **– This is probably irrelevant for AI developments. Or are you asking me if there will psychoanalysts for chatbots?

**J.L. **– It’s irrelevant because your definition of *intelligence* is based on a rigid schema, mainly derived from the utility theory. Your agents must be rational and they are intelligent if they can adapt their behavior to different environmental changes so to maximize the current and future advantage.

**G.B **– Reinforcement learning.

**J.L. **– Do you really think that all human processes are governed by a reinforcement learning algorithm? Imagine two agents closed into a rectangular universe. There are enough resources to make them live for a very long time. What’s the most rational behavior?

**G.B **– If we don’t consider the Game Theory and we assume that both agents know exactly the same things, well, the most rational strategy is to divide the resources. To consume a unit only after the other agents has done the same.

**J.L. **– Do human beings act this way?

**G.B **– To be honest… No, they often don’t.

**J.L. **– And what about love? Hate? Are they actually explainable using the reinforcement learning? If the purpose is the reproduction, human beings should copulate like dogs or rabbits. Don’t you think?

**G.B **– Indeed this is a very delicate problem. Many researchers prefer to avoid it. Someone else says that emotions are the effect of chemical reactions. On the other side, if I drink a glass of whiskey I feel different, uninhibited. It’s impossible not to say that also the emotions are driven by chemical processes.

**J.L. **– Probably they are, but I don’t care. No one cares. When you fall in love with a woman, are you asking yourself which combination of neurotransmitters determined that event? I think to be more software-oriented than you! If a message appears on a monitor, it’s obvious that some electronic circuits made it happen, but this answer doesn’t satisfy you. Am I correct?

**G.B **– I’d like to know which software determined the initial cause.

**J.L. **– Exactly. Now, returning to our chatbots, I want to be more challenging (and maybe annoying): do they have a conscience?

**G.B **– I don’t know.

**J.L. **– Do they have a subconscious?

**G.B **– Why should they have one?

**J.L. **– Oh my God! Because you’re talking about artificial intelligence comparing your agents to human beings! Maybe it’d be easier if you took a rat as a benchmark!

**G.B **– Rats are an excellent benchmark to study the reinforcement learning. Have you ever heard about rats in a maze?

**J.L. **– Well, this is something I can accept. But you’re lowering your aim! I read all the news about Facebook and its chatbots that invented a language.

**G.B **– Hype. They were simply unable to communicate in English. But too many people thought to be in front of a Terminator scenario.

**J.L. **– That’s why someone can be afraid of AI! You are confirming what I said at the beginning of this dialogue. A nuclear bomb explodes and kills a million people. Stop. Everything is clear. If you’re the leader of a big nation, you could be interested in nuclear weapons because they are “useful” in some situations. But, at the same time, it’s easier to oppose such a decision. With AI, the situation is completely different, because sometimes there’s a lack of scientificity.

**G.B **– You complained about this also for the psychoanalysis. For this reason, you introduced the matheme notation.

**J.L. **– The beauty of mathematics is its objectivity and I hope that also the most important philosophical concepts regarding AI will be one day formalized.

**G.B **– I think it’s easier to formalize an algorithm.

**J.L. **– Of course it is. But an algorithm for the subconscious must be interpreted in a completely different way and sometimes it can also be impossible! That’s why I keep on saying that the best arena for artificial intelligence is a linguistic world. I want to finish with a quote from one my seminars: “*Love is giving something you don’t have to someone who doesn’t want it*“. I’m still waiting for a rational artificial agent which struggles to give something it doesn’t have to another agent which refuses it! Can reinforcement learning explain it?

**G.B **– *Just silence*

See also:

## Artificial Intelligence is a matter of Language – Giuseppe Bonaccorso

“The limits of my language means the limits of my world.” (L. Wittgenstein) When Jacques Lacan proposed his psychoanalytical theory based on the influence of language on human beings, many auditors remained initially astonished. Is language an actual limitation? In the popular culture, it isn’t. It cannot be!

The post A virtual Jacques Lacan discusses about Artificial Intelligence appeared first on Giuseppe Bonaccorso.

]]>The post Linearly Separable? No? For me it is! A Brief introduction to Kernel Methods appeared first on Giuseppe Bonaccorso.

]]>Of course, the answer is yes, it is. Why? A dataset defined in a subspace Ω ⊆ ℜ^{n} is linearly separable if there exists a (n-1)-dimensional hypersurface that is able to separate all points belonging to a class from the others. Let’s consider the problem from another viewpoint, supposing, for simplicity, to work in 2D.

We have defined an hypothetical separating line and we have also set an arbitrary point O as an origin. Let’s now draw the vector w, orthogonal to the line and pointing in one of the two sub-spaces. Let’s now consider the inner product between w and a random point x_{0}: how can we decide if it’s on the side pointed by w? Simple, the inner product is proportional to the cosine of the angle between the two vectors. If such an angle is between -π/2 and π/2, the cosine is non-negative, otherwise is non-positive:

It’s to prove that all angles α between w and green dots are always bounded between -π/2 and π/2, therefore, all inner products are positive (there no points on the line, where the cosine is 0). At the same time, all β angles are bounded between π/2 and 3π/2 where the cosine is negative.

Now it should be obvious, that w is the weight vector of a “pure” linear classifier (like a Perceptron) and the corresponding decision rule is simply:

The way to determine w changes according to the algorithm and so the decision rule, which, however is always a function strictly related to the angle between w and the sample.

Let’s now consider another famous example:

Is this dataset linearly separable? You’re free to try, but the answer is no. For every line you draw, there will be always red and green points on the same side, so that the accuracy can never overcome a fixed threshold (quite low). Now, let’s suppose to project the original dataset into a m-dimensional feature space (m can also be equal to n):

After the transformation, the original half-moon dataset, becomes linearly separable! However, there’s still a problem: we need a classifier working in the original space, where we must compute an inner product of transformed points. Let’s analyze the problem using the SVM (Support Vector Machines), where the kernels are very diffused. A SVM is an algorithm that tries to maximize the separation between high-density blocks. I’m not covering all details (which can be found in every Machine Learning book), but I need to define the expression for the weights:

It’s not important if you don’t know how they are derived. It’s simpler to understand the final result: if you reconsider the image shown before where we had drawn a separating line, all green points have y=1, therefore w is the weighted sum of all samples. The values of α_{i} are obtained through the optimization process, however, at the end, we should observe a w similar to what we have arbitrarily drawn.

If we need to move on the feature space, every x, must be filtered by φ(x), therefore, we get:

And the corresponding decision function becomes:

As the sum is extended to all samples, we need to compute N inner products of transformed samples. This isn’t a properly fast solution! However, there are particular functions called kernels, which have a very nice property:

In other words, the kernel, evaluated at x_{i}, x_{j} is equal to the inner product of the two points projected by a function ρ(x). Pay attention: ρ is another function because, if we start from the projection, we couldn’t be able to find a kernel, while if we start from the kernel, there is also a famous mathematical result (the Mercer’s theorem) that guarantees, given a continuous, symmetric a (semi-)positive definite kernel, the existence of an implicit transformation ρ(x) under some (easily met) conditions.

It’s not difficult to understand, that using a kernel, the decision function becomes:

which is much easier to evaluate.

From a theoretical viewpoint we could try to find a kernel for a specific transformation, but in the real-life, only some kernels resulted to be really useful in many different contexts. For example, the Radial Basis Function:

which is excellent whenever we need to remap the original space into a radial one. Or the polynomial kernel:

which determines all the polynomial term-features (like x^{2} or 2 x_{i}x_{j}). There are also other kernels, like sigmoid or tanh, but it’s always a good idea to start with a linear classifier, if the accuracy is too low, it’s possible to try a RBF or polynomial kernel, then other variants, and, if nothing works, it’s probably necessary to change model!

Kernels are not limited to classifiers: they can be employed in PCA (Kernel PCA allows to extract principal components from non-linear datasets and instance-based-learning algorithms. The principle is always the same (and it’s the kernel silver bullet: it represents the inner product of non-linear projections that can remap the original dataset in a higher dimensional space where it’s easier to find a linear solution).

See also:

## Assessing clustering optimality with instability index – Giuseppe Bonaccorso

Many clustering algorithms need to define the number of desired clusters before fitting the model. This requirement can appear as a contradiction in an unsupervised scenario, however, in many real-word scenarios, the data scientist has often already an idea about a reasonable range of clusters.

The post Linearly Separable? No? For me it is! A Brief introduction to Kernel Methods appeared first on Giuseppe Bonaccorso.

]]>The post PCA with Rubner-Tavan Networks appeared first on Giuseppe Bonaccorso.

]]>

The eigenvectors are sorted in descending order considering the corresponding eigenvalue, therefore Cpca is a diagonal matrix where the non-null elements are λ1 >= λ2 >= λ3 >= … >= λn. By selecting the top p eigenvalues, it’s possible to operate a dimensionality reduction by projecting the samples in the new sub-space determined by the p top eigenvectors (it’s possible to use Gram-Schmidt orthonormalization if they don’t have a unitary length). The standard PCA procedure works with a bottom-up approach, obtaining the decorrelation of C as a final effect, however, it’s possible to employ neural networks, imposing this condition as an optimization step. One the most effective model has been proposed by Rubner and Tavan (and it’s named after them). Its generic structure is:

Where we suppose that N (input-dimensionality) << M (output-dimensionality). The output of the network can be computed as:

where V (n × n) is a lower-triangular matrix with all diagonal elements to 0 and W has a shape (n × m). Moreover, it’s necessary to store the y(t) in order to compute y(t+1). This procedure must be repeated until the output vector has been stabilized. In general after k < 10 iterations, the modifications are under a threshold of 0.0001, however, it’s important to check this value in every real application.

The training process is managed with two update rules:

The first one is Hebbian based on the Oja’s rule, while the second is anti-Hebbian because its purpose is to reduce the correlation between output units. In fact, without the normalization factor, the update is in the form dW = -αy(i)y(k), so to reduce the synaptic weight when two output units are correlated (same sign).

If the matrix V is kept upper-triangular, it’s possible to vectorize the process. There are also other variants, like the Földiák network, which adopts a symmetric V matrix, so to add the contribution of all other units to y(i). However, the Rubner-Tavan model seems more similar to the process adopted in a sequential PCA, where the second component is computed as orthogonal to the first and so forth until the last one.

The example code is based on the MNIST dataset provided by Scikit-Learn and adopts a fixed a number of cycles to stabilize the output. The code is also available in this GIST:

View the code on Gist.

See also:

## ML Algorithms Addendum: Hebbian Learning – Giuseppe Bonaccorso

Hebbian Learning is one the most famous learning theories, proposed by the Canadian psychologist Donald Hebb in 1949, many years before his results were confirmed through neuroscientific experiments.

The post PCA with Rubner-Tavan Networks appeared first on Giuseppe Bonaccorso.

]]>