Hacker News new | past | comments | ask | show | jobs | submit login

KL divergence is motivated nicely from an information/coding theory viewpoint. It's very closely related to Shannon-von Neumann entropy [1], and KL(P||Q) characterizes the efficiency of a code designed for a model distribution P, when applied to reality which is actually represented by Q.

A lot of recent work focuses on the Wasserstein divergence [1] as an alternative. One advantage of Wasserstein over KL is that the Wasserstein metric provides better fit over the whole distribution instead of localizing on some specific regions, thereby preventing "mode collapse". This makes it a popular metric for training Generative Adversarial Networks (GANs).

For recent work on applying Wasserstein distance to variational inference, see: https://arxiv.org/abs/1805.11284

[1]: https://physics.stackexchange.com/questions/64574/definition... [2]: https://en.wikipedia.org/wiki/Wasserstein_metric




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: