Information Theory for Machine Learning

A unifying framework for representation, compression, and prediction.

Entropy and uncertainty

Entropy measures average surprise in a random variable. High entropy indicates uncertainty; low entropy indicates concentration. In ML, entropy appears in objective design, calibration, and active learning.

Cross-entropy and KL divergence

Cross-entropy compares a true distribution to a predicted one and is the dominant loss for classification and language modeling. KL divergence quantifies mismatch and drives variational inference objectives such as ELBO.

Mutual information

Mutual information measures shared information between variables. It helps formalize feature relevance, representation quality, and self-supervised objectives that maximize signal about useful targets while discarding nuisance factors.

Compression and generalization

The minimum description length principle links compression to learning: models that encode data efficiently often capture structure rather than noise. This idea connects to regularization and complexity control.

Takeaway: Information theory explains why many ML objectives work by reframing learning as efficient uncertainty reduction.