Six (and a half) intuitions for KL divergence demystifies a mathy workhorse

A new explainer rounds up different ways to think about Kullback–Leibler (KL) divergence, that oddly behaved but indispensable quantity in information theory and machine learning. The post collects six—and yes, a half—intuitions that people use in practice: surprise, hypothesis testing, maximum likelihood, compression, gambling, and lottery-style gains. It has been reported that the write-up was linked and discussed on Hacker News, where readers praised the clarity and the tidy summary the author offers.
Six compact metaphors
The piece stitches together conceptually different but mathematically equivalent views. KL measures extra surprise when you believe Q but the world is P; it’s the expected log-evidence in hypothesis testing when P is true; when P is empirical, minimizing KL over Q gives the MLE; it’s the bits you waste compressing P with a code optimized for Q; and, delightfully, it’s the log-space edge you’d have at a casino or in a lottery if you knew P while everyone else used Q. The post also explains why KL isn’t symmetric: we care about divergence in the world where P is actually true — direction matters.
Why readers should care
KL divergence isn’t just an abstract curiosity. It shows up everywhere from variational inference to model selection and generative modeling. The emotional payoff — that satisfying “aha!” when a dozen opaque formulas collapse into one intuitive image, like wasted bits or betting odds — is the heart of the post. The author even suggests the summary may deliver more than half the value, so you can get a lot from a quick skim.
If you’ve wrestled with why KL can blow up when probabilities vanish, or why it’s not a distance in the usual sense, this is the kind of explainer that ties the loose ends together. Read it for the metaphors; keep it for the moments you need to explain KL to a teammate without reaching for integrals.
Sources: perfectlynormal.co.uk, Hacker News
Comments