• Blog
  • Podcasts
  • Books
  • Resume / CV
  • Bonaccorso’s Law
  • Essays
  • Contact
  • Testimonials
  • Disclaimer

Giuseppe Bonaccorso

Artificial Intelligence – Machine Learning – Data Science

  • Blog
  • Podcasts
  • Books
  • Resume / CV
  • Bonaccorso’s Law
  • Essays
  • Contact
  • Testimonials
  • Disclaimer

ML Algorithms Addendum: Fisher Information

09/02/2017 Artificial Intelligence Machine Learning Machine Learning Algorithms Addenda Python Tensorflow No Comments

Fisher Information, named after the statistician Ronald Fisher, is a very powerful tool in both Bayesian Statistics and Machine Learning. To understand its “philosophical” meaning, it’s useful to think about a simple classification task, where our task is to find a model (characterized by a set of parameters) that is able to reproduce the data generating process p(x) (normally represented as a data set) with the highest possible accuracy. This condition can be expressed using the Kullback-Leibler divergence:

Finding the best parameters that minimize the Kullback-Leibler divergence is equivalent to maximizing the log-likelihood log P(x|M) because p(x) doesn’t contain any reference to M. Now, a good (informal) question is: how much information p(x) can provide to simplify the task to the model? In other words, given a log-likelihood, which is its behavior next to the MLE point?

Considering the following two plots, we can observe two different situations. The first one is very peaked likelihood where the probability of any parameter value far from the MLE is almost close to zero. In the second chart, instead, the curve is flatter and p(MLE) is very close to many sub-optimal states.

Moreover, as the majority of learning techniques rely on the gradient, in the first case, its magnitude is very when getting close to the MLE, while, in the second, the speed of the gradient is very slow and the learning process can take a quite long time to reach the maximum.

Fisher Information helps in determining how much peaked is the likelihood function and provides a precise measure of the amount of information that the data generating process p(x) can supply to M to reach the MLE. The simplest definition is based on a single parameter ρ. In a discrete form (more common in machine learning), it’s equivalent to:

The lower bound is 0, meaning that no information is provided, while there’s no upper bound (+∞). If the model, as often happens, has many parameters (ρ1, ρ2, …, ρN), it’s possible to compute the Fisher Information Matrix, defined as:

Under some regularity conditions, it’s possible to transform both expressions using second derivatives (very often the Fisher Information Matrix) is computed as negative Hessian), however, I prefer to keep the original expressions because they can always be applied without particular restrictions. Even if, without explicit second derivatives, the Fisher Information provides a “speed” of the gradient computed for each parameter. The faster it is, the easier is the learning process.

Now, let’s suppose that we are estimating the value of a parameter ρ, through an approximation ρ′. If E[ρ′] = ρ, the estimator is called unbiased and the only statistical element to into account is the variance. If the expected value is correct, but the variance is very high, the probability of a wrong value is also very high. Therefore, in machine learning, we’re interested in knowing which is the minimum variance of an unbiased estimator. The Cramér-Rao bound states that:

Therefore, a high Fisher Information implies a low variance lower bound, while a low Fisher Information limits our model by imposing a quite high minimum achievable variance. Of course, this doesn’t mean that every unbiased estimator can reach the Cramér-Rao bound, but it provides us with a measure of theoretical maximum accuracy we can achieve if our model is powerful enough.

To show how to compute the Fisher Information Matrix and apply the Cramér-Rao bound, let’s consider a very simple dataset (our data generating process) that can be easily learned by a Logistic Regression:

For the sake of simplicity, the 2D data set has been modified adding a “ones” column, so to learn the bias as any other weight. The implementation is based on Tensorflow (available on this GIST), that allows numerically stable gradient computation:

The resulting Fisher Information Matrix is:

The parameters are (bias, w1, w2). The result shows that the original data set provides a quite high information when estimating the bias and w2 and a lower information for the estimation of w1. The Cramér-Rao bound sets a very small minimum variance for bias and w2 and a relatively higher value for w1. An interpretation of this result can be derived by the structure of the data set: moving along the x2 axis, it’s easy to find a boundary that separates blue and green dots, on the other side, this is not true when exploring the first dimension. On the other side, none of the FIM values are 0 (or very close to 0), meaning that the parameters aren’t orthogonal and they can’t be learned separately.

If you want to obtain an analytic FIM expression, it’s necessary to consider all possible (or a wide set) of values and computing the expected value averaging over ρ.

For a deeper understand about Fisher Information, you see this lecture on Youtube: https://www.youtube.com/watch?v=m1X4uC3hQ_o

See also:

ML Algorithms addendum: Mutual information in classification tasks – Giuseppe Bonaccorso

Many classification algorithms, both in machine and in deep learning, adopt the cross-entropy as cost function. This is a brief explanation why minimizing the cross-entropy allows to increase the mutual information between training and learned distributions.

Share:

  • Click to share on Twitter (Opens in new window)
  • Click to share on Facebook (Opens in new window)
  • Click to share on LinkedIn (Opens in new window)
  • Click to share on Pocket (Opens in new window)
  • Click to share on Tumblr (Opens in new window)
  • Click to share on Reddit (Opens in new window)
  • Click to share on Pinterest (Opens in new window)
  • Click to share on Skype (Opens in new window)
  • Click to share on WhatsApp (Opens in new window)
  • Click to share on Telegram (Opens in new window)
  • Click to email this to a friend (Opens in new window)
  • Click to print (Opens in new window)

You can also be interested in these articles:

information theorymachine learningstatistical learningtensorflow

ML Algorithms Addendum: Instance Based Learning

An annotated path to start with Machine Learning

Leave a Reply Cancel reply

Follow Me

  • linkedin
  • twitter
  • facebook
  • googlescholar
  • youtube
  • github
  • amazon
  • medium

Search articles

Latest blog posts

  • Mastering Machine Learning Algorithms Second Edition 02/03/2020
  • EphMrA 2019 Switzerland one day meeting 08/30/2019
  • Machine Learning Algorithms – Second Edition 08/28/2018
  • Recommendations and User-Profiling from Implicit Feedbacks 07/10/2018
  • Are recommendations really helpful? A brief non-technical discussion 06/29/2018

Subscribe to this blog

Join 2,199 other subscribers

Follow me on Twitter

My Tweets
Copyright © Giuseppe Bonaccorso. All Rights Reserved
Proudly powered by WordPress | Theme: Doo by ThemeVS.
loading Cancel
Post was not sent - check your email addresses!
Email check failed, please try again
Sorry, your blog cannot share posts by email.