Many clustering algorithms need to define the number of desired clusters before fitting the model. This requirement can appear as a contradiction in an unsupervised scenario, however, in many real-word scenarios, the data scientist has often already an idea about a reasonable range of clusters. Unfortunately this doesn’t imply being able to define the optimal number of clusters without a further analysis.
Cluster instability analysis is a very powerful tool to assess the optimality of a specific algorithm or configuration. The idea, developed in , is quite simple and can be easily understood considering a mechanical analogy. If we have some ball on a table, we want to group them so that small perturbations (movements) cannot alter the initial cluster assignment. For example, if two groups are very close, a perturbation can push a ball in another cluster, as shown in the following figure:
In this case, we can say that this choice is not very stable and it’s better to try with a different clustering approach or number of clusters.
More formally, if we consider a clustering function C(X) and an original dataset X, we can create n perturbed versions of X (for example, adding noise or subsampling):
Considering a distance metric function d(X, Y) between two clusterings (Euclidean, Hamming, and so on), it’s possible to define the instability as:
The following example is based on Scikit-Learn and K-Means algorithm. The original dataset (generated using the make_blobs function) is shown in the following figure:
To test the stability, 20 noisy versions are generated and the process is repeated for a number of clusters between 2 and 15 using the normalized Hamming distance:
The corresponding plot is:
If we exclude 2 and 3 clusters, a good solution seems to be offered by n=6, where the instability has a local minimum. Greater values are not acceptable because the instability increases over an acceptable threshold. After all, considering the initial distribution, it’s easy to understand that a very large number of clusters (for example 15) drives to very close groups, where a perturbation can alter dramatically the configuration, pushing too many points into a different cluster.
Picking n=6, a default Scikit-Learn K-Means produces the following result:
The clustering appears reasonable and the blocks show a good internal cohesion even if the inter-cluster separation is not very high (due to the original distribution, which is quite dense).
- Von Luxburg U., Cluster stability: an overview, arXiv 1007:1075v1
Recommendation system based on the user-item matrix factorization have become more and more important thanks to powerful and distributable algorithms like ALS, but sometimes the number of users and/or items is not so huge and the computation can be done using directly a SVD (Singular Value Decomposition) algorithm.