A glimpse into the Self-Organizing Maps (SOM)

Self-Organizing Maps (SOM) are neural structures proposed for the first time by the computer scientist T. Kohonen in the late 1980s (that’s why they are also known as Kohonen Networks). Their peculiarities are the ability to auto-cluster data according to the topological features of the samples and the approach to the learning process. Contrary to methods like Gaussian Mixtures or K-Means, a SOM learns through a competitive learning process. In other words, the model tries to specialize its neurons so to be able to produce a response only for a particular pattern family (it can also be a single input sample representing a family, like a handwritten letter).

Let’s consider a dataset containing N p-dimensional samples, a suitable SOM is a matrix (other shapes, like toroids, are also possible) containing (K × L) receptors and each of them is made up of p synaptic weights. The resulting structure is a tridimensional matrix W with a shape (K × L × p).

During the learning process, a sample x is drawn from the data generating distribution X and a winning unit is computed as:

In the previous formula, I’ve adopted the vectorized notation. The process must compute the distance between x and all p-dimensional vectors W[i, j], determining the tuple (i, j) corresponding to the unit has shown the maximum activation (minimum distance). Once this winning unit has been found, a distance function must be computed. Consider the schema showed in the following figure:

At the beginning of the training process, we are not sure if the winning unit will remain the same, therefore we apply the update to a neighborhood centered in (i, j) and a radius which decreases proportionally to the training epoch. In the figure, for example, the winning unit can start considering also the units (2, 3) → (2, L), …, (K, 3) → (K, L), but, after some epochs, the radius becomes 1, considering only (i, j) without any other neighbour.

This function is represented as (K × L) matrix whose generic element is:

The numerator of the exponential is the Euclidean distance between the winning unit and the generic receptor (i, j). The distance is controlled by the parameter σ(t) which should decrease (possibly exponentially) with the number of epochs. Many authors (like Floreano and Mattiussi, see the reference for further information) suggest introducing a time-constant τ and defining σ(t) as:

The competitive update rule for the weights is:

Where η(t) is the learning rate, that can be fixed (for example η = 0.05) or exponentially decaying like σ(t). The weights are updated summing the Δw term:

As it’s possible to see, the weights belonging to the neighboorhood of the winning unit (determined by δf) are simply moved closer to x, while the others remain fixed (δf = 0). The role of the distance function is to impose a maximum update (which must produce the strongest response) in proximity to the winning unit and a decreasing one to all the other units. In the following figure there’s a bidimensional representation of this process:

All the units with i < i-2 and i > i+2 are kept fixed. Therefore, when the distance function becomes narrow so to include only the winning unit, a maximum selectivity principle is achieved through a competitive process. This strategy is necessary to allow the map creating the responsive areas in the most efficient way. Without a distance function, the competition could not happen at all or, in some cases, the same unit could be activated by the same patterns, driving the network to an infinite cycle.

Even if we don’t provide any proof, a Kohonen network trained using a decreasing distance function will converge to a stable state if the matrix (K × L) is large enough. In general, I suggest, to try with smaller matrices (but with KL > N) and increasing the dimensions until the desired result is achieved. Unfortunately, however, the process is quite slow and the convergence is achieved normally after thousands of iterations of the whole dataset. For this reason, it’s often preferable to reduce the dimensionality (via PCA, Kernel-PCA or NMF) and using a standard clustering algorithm.

An important element is the initialization of W. There are no best practices, however, if the weights are already close to the most significant components of the population, the convergence can be faster. A possibility is to perform a spectral decomposition (like in a PCA) to determine the principal eigenvector(s). However, when the dataset presents non-linearities, this process is almost useless. I suggest initializing the weights randomly sampling the values from a Uniform(min(X), max(X)) distribution (X is the input dataset). Another possibility is to try different initializations computing the average error on the whole dataset and taking the values that minimize this error (it’s useful to try to sample from Gaussian and Uniform distributions with a number of trials greater than 10). Sampling from a Uniform distribution, in general, produces average errors with a standard deviation < 0.1, therefore it doesn’t make sense to try too many combinations.

To test the SOM, I’ve decided to use the first 100 faces from the Olivetti dataset (AT&T Laboratories Cambridge), provided by Scikit-Learn. Each sample is length 4096 floats [0, 1] (corresponding to 64 × 64 grayscale images). I’ve used a matrix with shape (20 × 20), 5000 iterations, using a distance function with σ(0) = 10 and τ = 400 for the first 100 iterations and fixing the values σ = 0.01 and η = 0.1 ÷ 0.5 for the remaining ones. In order to speed up the block-computations, I’ve decided to use Cupy that works like NumPy but exploits the GPU (it’s possible to switch to NumPy simply changing the import and using the namespace np instead of cp). Unfortunately, there are many cycles and it’s not so easy to parallelize all the operations. The code is available in this GIST:

The matrix with the weight vectors reshaped as square images (like the original samples), is shown in the following figure:

A sharped-eyed reader can see some slight differences between the faces (in the eyes, nose, mouth, and brow). The most defined faces correspond to winning units, while the others are neighbors. Consider that each of them is a synaptic weight and it must match a specific pattern. We can think of those vectors as pseudo-eigenfaces like in PCA, even if the competitive learning tries to find the values that minimize the Euclidean distance, so the objective is not finding independent components. As the dataset is limited to 100 samples, not all the face details have been included in the training set.

References:

See also:

ML Algorithms addendum: Passive Aggressive Algorithms – Giuseppe Bonaccorso

Passive Aggressive Algorithms are a family of online learning algorithms (for both classification and regression) proposed by Crammer at al. The idea is very simple and their performance has been proofed to be superior to many other alternative methods like Online Perceptron and MIRA (see the original paper in the reference section).