Many classification algorithms, both in machine and in deep learning, adopt the cross-entropy as cost function. This is a brief explanation why minimizing the cross-entropy allows to increase the mutual information between training and learned distributions. If we call p the training set probability distribution and q, the corresponding learned one, the cross-entropy is: By manipulating this expression, we get: Therefore, the cross-entropy is equal to the sum of H(p), which is the entropy of the training distribution (that we can’t control) and the Kullback-Leibler divergence of the learned distribution from the training one. As the first term is a constant, minimizing the cross-entropy is equivalent to minimizing the Kullback-Leibler divergence. We know that: Therefore, the training process will “remodel” q(x) in order to minimize its divervenge from p(x). In the following figure, there’s a schematic representation of this process before the initial iteration, at iteraion n and at the … Continue reading ML Algorithms addendum: Mutual information in classification tasks