• Blog
  • Podcasts
  • Books
  • Resume / CV
  • Bonaccorso’s Law
  • Essays
  • Contact
  • Testimonials
  • Disclaimer

Giuseppe Bonaccorso

Artificial Intelligence – Machine Learning – Data Science

  • Blog
  • Podcasts
  • Books
  • Resume / CV
  • Bonaccorso’s Law
  • Essays
  • Contact
  • Testimonials
  • Disclaimer

Assessing clustering optimality with instability index

08/03/2017 Machine Learning Python Scikit-Learn No Comments

Many clustering algorithms need to define the number of desired clusters before fitting the model. This requirement can appear as a contradiction in an unsupervised scenario, however, in many real-word scenarios, the data scientist has often already an idea about a reasonable range of clusters. Unfortunately this doesn’t imply being able to define the optimal number of clusters without a further analysis.

Cluster instability analysis is a very powerful tool to assess the optimality of a specific algorithm or configuration. The idea, developed in [1], is quite simple and can be easily understood considering a mechanical analogy. If we have some ball on a table, we want to group them so that small perturbations (movements) cannot alter the initial cluster assignment. For example, if two groups are very close, a perturbation can push a ball in another cluster, as shown in the following figure:

In this case, we can say that this choice is not very stable and it’s better to try with a different clustering approach or number of clusters.

More formally, if we consider a clustering function C(X) and an original dataset X, we can create n perturbed versions of X (for example, adding noise or subsampling):

Considering a distance metric function d(X, Y) between two clusterings (Euclidean, Hamming, and so on), it’s possible to define the instability as:

The following example is based on Scikit-Learn and K-Means algorithm. The original dataset (generated using the make_blobs function) is shown in the following figure:

To test the stability, 20 noisy versions are generated and the process is repeated for a number of clusters between 2 and 15 using the normalized Hamming distance:

The corresponding plot is:

If we exclude 2 and 3 clusters, a good solution seems to be offered by n=6, where the instability has a local minimum. Greater values are not acceptable because the instability increases over an acceptable threshold. After all, considering the initial distribution, it’s easy to understand that a very large number of clusters (for example 15) drives to very close groups, where a perturbation can alter dramatically the configuration, pushing too many points into a different cluster.

Picking n=6, a default Scikit-Learn K-Means produces the following result:

The clustering appears reasonable and the blocks show a good internal cohesion even if the inter-cluster separation is not very high (due to the original distribution, which is quite dense).

References:

  1.  Von Luxburg U., Cluster stability: an overview, arXiv 1007:1075v1

See also:

SVD Recommendations using Tensorflow – Giuseppe Bonaccorso

Recommendation system based on the user-item matrix factorization have become more and more important thanks to powerful and distributable algorithms like ALS, but sometimes the number of users and/or items is not so huge and the computation can be done using directly a SVD (Singular Value Decomposition) algorithm.

Share:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pocket (Opens in new window)
  • Click to share on Tumblr (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Skype (Opens in new window)
  • Click to share on WhatsApp (Opens in new window)
  • Click to share on Telegram (Opens in new window)
  • Click to email this to a friend (Opens in new window)
  • Click to print (Opens in new window)

You can also be interested in these articles:

clusteringmachine learningpythonscikit-learnstability

SVD Recommendations using Tensorflow

Twitter Sentiment Analysis with Gensim Word2Vec and Keras Convolutional Networks

Leave a Reply Cancel reply

Follow Me

  • linkedin
  • twitter
  • facebook
  • googlescholar
  • youtube
  • github
  • amazon
  • medium

Search articles

Latest blog posts

  • Mastering Machine Learning Algorithms Second Edition 02/03/2020
  • EphMrA 2019 Switzerland one day meeting 08/30/2019
  • Machine Learning Algorithms – Second Edition 08/28/2018
  • Recommendations and User-Profiling from Implicit Feedbacks 07/10/2018
  • Are recommendations really helpful? A brief non-technical discussion 06/29/2018

Subscribe to this blog

Join 2,199 other subscribers

Follow me on Twitter

My Tweets
Copyright © Giuseppe Bonaccorso. All Rights Reserved
Proudly powered by WordPress | Theme: Doo by ThemeVS.
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.