ML Algorithms Addendum: Instance Based Learning

Contrary to the majority of machine learning algorithms, Instance-Based Learning is model-free, meaning that there are strong assumptions about the structure of regressors, classifiers or clustering functions. They are “simply” determined by the data, according to an affinity induced by a distance metric (the most common name for this approach is Nearest Neighbors).

If we have a dataset X:

it’s possible to introduce a metric in the m-dimensional hyperspace. The most common choice is the Euclidean metric:

Computing the distance between every couple of samples, we get a symmetric matrix which is already a global measure of the affinity among points:

Indeed, even if it’s not a model, this is the most similar thing to a model based on the natural structure of the training data. As the hyperspace is normally a feature space, two points which are close, share (or have similar) features. This is the only real assumption in this kind of approach and, in fact, the geometry of the dataset (together with a distance function) is the algorithm itself. Therefore, a prediction is made by searching to top k nearest neighbors and:

  • In a clustering approach, considering them as a set of very similar objects whose features can be inherited by the new sample. A common application is in user-based recommendation systems, where a new user has to “borrow” the features provided by his/her neighbors.
  • In a classification approach, averaging the classes (it’s also possible to weight them using the normalized distances) it’s easy to determine the class of the new sample
  • In a regression approach, considering the weighted value of all neighbors as training points for a standard regression

Unfortunately, computing M is very expensive, in fact, with n samples belonging to an m-dimensional space, we need to compute m multiplications for every couple of samples. It means that this algorithm, often called “brute-force” has a “training” complexity O(mn²) and a prediction complexity O(mn), showing to suffer a dramatic curse of dimensionality.

To overcome this problem, two alternatives have been proposed (and they are all implemented by Scikit-Learn):

  • KD-Trees
  • Ball-Trees

A KD-Tree is a natural extension of a binary tree to K-dimensional data. The idea is quite simple and can be explained considering the following figure:

If we set a threshold R, then:

In other words, if the distance between x1 and x2 overcomes a fixed threshold (disqualifying x2 from being a neighbor) and the distance between x2 and x3 is very shorter than R, then x3 is also a non-neighbor and it doens’t make sense to explicitly compute the distance between x1 and x3. With a KD-Tree, the space is partitioned using binary k-dimensional nodes and in each of them only a single dimension is chosen to split the the original sub-space into two half spaces. The dynamic is very intuitive in 2D. Thanks to Wikipedia contributors KiwiSunset and MYguel, I can show a simple example, without the need to redraw from scratch.

There are 6 points in a 2D space:

The red lines represent cuts on the X-axis and the red lines on the Y-axis. Starting from the point (7, 2) and cutting on X, we have 3 points where x<7 and two point where x>7:

At the second level, we need to use Y:

  • (5, 4) -> There is one node (2, 3) where y<4 and another node (4, 7) where y>4
  • (9, 6) -> There is a final (leaf) node (8, 1)

The procedure is simple and can be extended to k-dimensional samples. As it is a binary tree, the training complexity becomes O(mn log(n)). Even if this is a better result than brute-force algorithm, a KD-Tree isn’t able to exploit the radial structure of the Euclidean metric and it’s very sensitive to the dimensionality.

An alternative algorithm that can often outperform KD-Tree is represented by Ball-Tree. A Ball-Tree is always a binary tree, but, in this case, each node defines a hypersphere and two sub-nodes defines two sub-hyper-spheres whose union must contain all points assigned to the parent node, and whose intersection is always empty (even if the hyperspheres overlap). The dynamic is not very different from KD-Trees, but in general, Ball-Trees are more efficient when the dimensionality is very high. In fact, considering a scenario similar to the one previously presented, we can consider a point x1 (possibly a prediction point) and another point x2 with a subtree where we find x3 and x4:

In a research, if x2 is the closest point (to x1), then all the points belonging to the ball whose centroid has a distance from x1 greater than d(x1, x2) can be completely discarded because all its points are more distant than x2. If the ball is very dense, it allows to reduce dramatically the number of comparisons and the overall performances become superior to the version with KD-Tree.

Scikit-Learn implements clustering, classification, and regression based on k-Nearest-Neighbors algorithms (with brute-force, KD-Trees and Ball-Trees and different metrics). Usage is very intuitive and there’s no need to provide further examples (on the website there are many interesting applications), however, I wanted to benchmark the performances of all algorithms. In the following snippet, 10 datasets (with increasing number of samples and dimensions) are created and a NearestNeighbors instance is initialized for each couple (dataset, algorithm):

Training time plots:

Prediction time plots:

As expected, the training time is shorter for brute force, but the corresponding prediction time is quite longer. Both KD-Tree and Ball-Tree have longer training times due to the need of building the trees. However, queries are very fast. For example: with 500,000 2000-dimensional samples, brute force needs about 11 seconds, while KD-Tree about 0.95 and Ball-Tree 0.55. Brute force performances are acceptable only when both n and m are very small and the cost of building a tree is too high. In all those cases, Scikit-Learn provides a fourth alternative called ‘auto’, which automatically selects the best strategy according to the dimensionality and uses ‘leaf_size’ to pass from a tree-based algorithm to the brute-force one.

See also:

Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Networks – Giuseppe Bonaccorso

Word2Vec ( https://code.google.com/archive/p/word2vec/) offers a very interesting alternative to classical NLP based on term-frequency matrices. In particular, as each word is embedded into a high-dimensional vector, it’s possible to consider a sentence like a sequence of points that determine an implicit geometry.