What is the difference between “abababababababababab” and “4c1j5b2p0cv4w1x8rx2y39”? Apart from the alphabet size (the first sequence has only 2 letters, while the second is drawing form a larger pool of letters and numbers), the difference lies in the way of repetition. We can represent the first sequence with a much sorter instruction: repeat “ab” ten times (“ab”x10), whereas the second one does not lend itself to compression. Actually, one could argue that the business of physics has always been about compressing the wide range of natural phenomena into mathematical formulae. One of the most celebrated examples is “F=ma”, which describes an apple falling, the motion of planets, and all their imaginable variable occasions. Actually, compressibility is intimately linked to complexity, which can be defined as “the length of the shortest possible description”. So, the shorter the description of a process, the smaller its complexity. Note that complex does not mean complicated (made of many interconnected parts). So, in order to mathematically define compressibility –which we can show to be proportional to redundancy and inversely proportional to Shannon’s entropy, thus uncertainty– we only need to play a little heuristic exercise and count our savings at the end. For instance, given the sequence “S->abcdcbcdfbcd”, we can define a new symbol A as “A->bcd” and thus write “S->aAcAfA”. If we count the initial and final number of symbols we get 12 and 11 respectively, and thus we save 1 from the initial 12; compressibility yields C=1/12 — like a discount in a supermarket, we can save a little less than 10%. Bear in mind that the letters we use here are abstractions that represent anything that we can write down as a discrete sequence of data: from spikes in the brain, birds singing, DNA sequences, mice pressing for sucrose in an operant box, or babies babbling while learning to speak. If we try now to compress this other sequence “S->babaabaabaa”, we realize that there are many possible candidates of repeating motifs; some are longer and seem better candidates, others are shorter but more repeated. So, one should try the same procedure in all of them (that is why the method is called heuristic, and thus it has to be implemented in computers in order to deal with sequences whose size can be of the order of millions of characters). When we try the longest sequence, “abaa”, we realize that there are no savings possible, thus compressibility is zero in that case. When we try the most frequent sequence, “ba”, we realize we can compress the sequence twice (first defining “A->ba” and then “B->Aa”, obtaining “S->ABBB”), in what is called depth-2 iteration, but still compressibility is zero. Finally, choosing the grammar “A->aba”, we find “S->bAAAa”, thus saving 1 character [Nevill-Manning and Witten 2000]. Our original sequence is then 1/11 compressible. Differently to Shannon’s approach, this is a scale-independent method (or resolution-free) because we did not need to decide a priori on the n-gram length. However, compressibility is actually intimately related to Shannon’s entropy (H): a random sequence is very incompressible (thus it has C=0) and it has maximum entropy (H max), whereas a very repetitive sequence will be highly compressible (C close to 1) and the its entropy should be small (H close to zero). Thus we learn how two different approaches, with two different mathematical implementations, address the same question: how to detect high order in a sequence of data. In fact, that (plus an initial hypothesis framing the expectation on what is to be found, and some necessary statistical “checks” at the end) is what our work as scientists is all about: to discover repeated patterns in what appears to be noise. Science is the quantitative game of difference and repetition.
[Nevill-Manning and Witten 2000] Craig G. Nevill-Manning and Ian H. Witten, “On-Line and Off-Line Heuristics for Inferring Hierarchies of Repetitions in Sequences”, Proceedings of the IEEE, Vol.88, No.11, November 2000.