ML Algorithms Addendum: Hebbian Learning
Hebbian Learning is one the most famous learning theories, proposed by the Canadian psychologist Donald Hebb in 1949, many years before his results were confirmed through neuroscientific experiments. Artificial Intelligence researchers immediately understood the importance of his theory when applied to artificial neural networks and, even if more efficient algorithms have been adopted in order to solve complex problems, neuroscience continues finding more and more evidence of natural neurons whose learning process is almost perfectly modeled by Hebb’s equations.
Hebb’s rule is very simple and can be discussed starting from a high-level structure of a neuron with a single output:
We are considering a linear neuron, therefore the output y is a linear combination of its input values x:
According to the Hebbian theory, if both pre- and post-synaptic units behave in the same way (firing or remaining in the steady state), the corresponding synaptic weight will be reinforced. Vice versa, if their behavior is discordant, w will be weakened. In other words, using a famous aphorism, “Neurons that fire together, wire together”. From a mathematical viewpoint, this rule can expressed (in a discretized version) as:
Alpha is the learning rate. To better understand the implications of this rule, it’s useful to express it using vectors:
C is the input correlation matrix (if the samples are zero-centered, C is also the covariance matrix), therefore the weight vector will be updated in a way that maximizes the components corresponding to the input maximum variance direction. In particular, considering the time-continuous version, if C has a dominant eigenvalue, the solution w(t) can be expressed as a vector with the same direction of the corresponding C eigenvector. In other words, Hebbian learning is performing a PCA, extracting the first principal component.
Even if, this approach is feasible, there’s a problem in this rule: it is unstable. C is a positive semidefinite matrix, therefore its eigenvalues are always non-negative. That means the w(t) will be a linear combination of eigenvectors with coefficients increasing exponentially with t. Considering the discrete version, it’s easy to understand that, if x and y are greater than 1, the learning process will increase the absolute value of the weights indefinitely, producing an overflow.
This problem can be solved by imposing the normalization of weights after each update (so to make them saturate to finite values), but this solution is biologically unlikely because each synapse needs to know all other weights. The best alternative approach has been proposed by Oja and the rule is named after him:
The rule is always Hebbian, but it now includes an auto-normalizing term (-wy²). It’s easy to show the corresponding time-continuous differential equation has now negative eigenvalues and the solution w(t) converges. Just like in the pure Hebb’s rule, he vector w will always converge to the dominant C eigenvector, but in this case its norm will be a finite (small) number.
Implementing the Oja’s rule in Python (with Tensorflow too), it’s a very easy task. Let’s start with a random bidimensional centered dataset (obtained using Scikit-Learn make_blobs() and StandardScaler()):
In the following GIST, the covariance (correlation) matrix is computed, together with its eigenvectors and then the Oja’s rule is applied to the dataset:
As it’s possible to see, the algorithm has converged to the second eigenvector, whose corresponding eigenvalue is the highest.
An extension to the Oja’s rule to multi-output networks is provided by the Sanger’s rule (also known as Generalized Hebbian Algorithm):
In this case, the normalizing (and decorrelating) factor is applied considering only the synaptic weights before the current one (included). Using a vectorial notation, the update rule becomes:
Tril() is a function the returns the lower triangle of a square matrix. Sanger’s rule is able to extract all principal components starting from the first and continuing with all output units. Just like, for Oja’s rule, in the following GIST, the rule is applied to the same dataset:
As it’s possible to see, the weight matrix contains (as columns) the two principal components (roughly parallel to the eigenvectors of C).
Hebbian learning is a very powerful unsupervised approach, thanks to its simplicity and biological evidence. It’s easy to apply this methodology to different bio-inspired problems like orientation sensitivity. I recommend  for further details about these techniques and other neuroscientific models.
- Dayan P., Abbott L. F., Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems, The MIT Press
The Hodgkin-Huxley model (published on 1952 in The Journal of Physiology ) is the most famous spiking neuron model (also if there are simpler alternatives like the “Integrate-and-fire” model which performs quite well). It’s made up of a system of four ordinary differential equations that can be easily integrated using several different tools.