Models in Compression


After Information Theory showed how to build a perfect code for any specific symbolic system, people found that it was relatively easy to come up with new coding conventions for new types of data, and the digital revolution was on. At one point it was wondered just what couldn't be stored in a computer.

It was about this time, that scientists started looking for more extraordinary ways to reduce storage costs. One scientist hit on a solution. He said, if we model our data, we will find that certain patterns develop in the data. If we can code the longest and most prevalent patterns with the shortest codes and the shorter patterns with less prevalence with longer codes, we can reduce the size of a file without losing any data.

Unfortunately what he found was that the model needed to optimize the length of a file in one type of data was not the same as the model needed to optimize the length of a file in another type.

Another scientist noticed that when he looked at the binary codes of his data, there were strings of repetitive patterns. Why not, he said, abstract out the patterns, and simply count how many of them ran in a row and code the file so that the code efficiently handled statistically significant numbers of repetitions. So for instance consider a letter on a paper. Using this technique all white space that consisted of blanks would be reduced to a single space, and a number to indicate how many were needed to fill out the line.

In each of these cases, the scientist modelled their data, and used something they recognized in their model to significantly reduce storage. The model was what gave them the leverage to reduce the storage size without losing information.

Now a cautionary note, I once worked with a man that had great ideas about how to encode data. He wanted to develop a compression scheme that could be re-entered and compress even further. But his chosen technique didn't include a model of how encoding the data changed the data. He is still looking for a model today.


To do:
Say a few words about grammar-based compression, aka w:grammar-based codes.