# Learning based on Data Compression

Bayesian statistics are used all over the place, but in the shadows lurks another inference method called the Minimum Description Length principle (MDL). Although MDL and Bayesian inference turn out to be quite closely related technically, they are the product of very different sequences of ideas. In this presentation I will describe the background of MDL and how it turns out to be related to Bayesian statistics. To this end I will introduce Kolmogorov complexity, which is a measure of the amount of information in observed data that does not require probabilistic assumptions. Although Kolmogorov complexity, which is itself based on Turing’s notion of effective computation, can theoretically be used to learn just about anything that you might want to learn, unfortunately it is itself uncomputable. MDL was developed as a computable version of Kolmogorov complexity, but later borrowed heavily from information theory and statistics as well. The Kraft inequality provides the crucial connection between methods based on data compression and methods based on probability theory.

This presentation focuses at no one single paper, but below I’ve listed the relevant references, many of which are classics. However, for those interested in compression-based learning I’d recommend looking at the following two textbooks instead:

P.Vitanyi and M. Li 2008. “An Introduction to Kolmogorov Complexity and its Applications”. Third edition, Springer Verlag. See http://homepages.cwi.nl/paulv/kolmogorov.html. P. Grunwald 2007. “The Minimum Description Length principle”. MIT Press. See http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11155

References:

A.M. Turing 1936. “On Computable Numbers, with an Application to the Entscheidungsproblem”. Proceedings of the London Mathematical Society series 2, 42:230-265. Many versions appear online, for example http://www.thocp.net/biographies/papers/turing_oncomputablenumbers_1936.pdf

C.E. Shannon 1948. “A Mathematical Theory of Communication”, Bell Systems Technical Journal 27:379-423,623-656. Available at http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

L.G. Kraft 1949. “A Device for Quantising, Grouping and Coding Amplitude-Modulated Pulses”. Master’s thesis, MIT . Not available online.

B. McMillan 1956. “Two Inequalities implied by Unique Decipherability”. IEEE Transactions on Information Theory, 2(4):115-116. Available at http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=01056818

R. Solomonoff 1964. “A Formal Theory of Inductive Inference” (Parts I,II). Information and Control, 7(1):1-22 (Part I) and 7(2):224-254 (part II). Available at http://world.std.com/rjs/1964pt1.pdf (part I) http://world.std.com/~rjs/1964pt2.pdf (part II)

A.N. Kolmogorov 1965. “Three Approaches to the Quantitative Definition of Information”. Problems of Information Transmission 1(1):1-7 Not available online.

J. Rissanen 1978. “Modeling by the Shortest Data Description”. Automatica 14:465-471

This talk is part of the Statistics Reading Group series.