Many classification algorithms, both in machine and in deep learning, adopt the cross-entropy as cost function. This is a brief explanation why minimizing the cross-entropy allows to increase the mutual information between training and learned distributions.

If we call p the training set probability distribution and q, the corresponding learned one, the cross-entropy is: By manipulating this expression, we get: Therefore, the cross-entropy is equal to the sum of H(p), which is the entropy of the training distribution (that we can’t control) and the Kullback-Leibler divergence of the learned distribution from the training one. As the first term is a constant, minimizing the cross-entropy is equivalent to minimizing the Kullback-Leibler divergence.

We know that: Therefore, the training process will “remodel” q(x) in order to minimize its divervenge from p(x). In the following figure, there’s a schematic representation of this process before the initial iteration, at iteraion n and at the end of the training process: If we consider, the mutual information between p(x) and q(x), we obtain: The mutual information is the amount of information shared by both distributions and it is expressed as the entropy of the training distribution minus the “unresolved” uncertainty that q provides when chosen instead of p. In other words, if we have modeled q perfectly, it means that q = p, therefore H(q|q) = 0 and I(p;q) = H(p) (that represent the maximum amount of information that we can learn). Therefore, when we minimize the cross-entropy, we implicitly minimize the conditional entropy H(p|q), obtaining a maximization of the mutual information.

A good introductory book on Information Theory:  