Many classification algorithms, both in machine and in deep learning, adopt the cross-entropy as cost function. This is a brief explanation why minimizing the cross-entropy allows to increase the mutual information between training and learned distributions.
If we call p the training set probability distribution and q, the corresponding learned one, the cross-entropy is:
By manipulating this expression, we get:
Therefore, the cross-entropy is equal to the sum of H(p), which is the entropy of the training distribution (that we can’t control) and the Kullback-Leibler divergence of the learned distribution from the training one. As the first term is a constant, minimizing the cross-entropy is equivalent to minimizing the Kullback-Leibler divergence.
We know that:
Therefore, the training process will “remodel” q(x) in order to minimize its divervenge from p(x). In the following figure, there’s a schematic representation of this process before the initial iteration, at iteraion n and at the end of the training process:
If we consider, the mutual information between p(x) and q(x), we obtain:
The mutual information is the amount of information shared by both distributions and it is expressed as the entropy of the training distribution minus the “unresolved” uncertainty that q provides when chosen instead of p. In other words, if we have modeled q perfectly, it means that q = p, therefore H(q|q) = 0 and I(p;q) = H(p) (that represent the maximum amount of information that we can learn). Therefore, when we minimize the cross-entropy, we implicitly minimize the conditional entropy H(p|q), obtaining a maximization of the mutual information.
A good introductory book on Information Theory:
Fisher Information, named after the statistician Ronald Fisher, is a very powerful tool in both Bayesian Statistics and Machine Learning. To understand its “philosophical” meaning, it’s useful to think about a simple classification task, where our task is to find a model (characterized by a set of parameters) that is able to reproduce the data generating process p(x) (normally represented as a data set) with the highest possible accuracy.