Data Compression/The zero-frequency problem

One of the biggest problems of inferring a model is the zero frequency problem. (Zero frequency problem and escape codes by Arturo San Emeterio Campos. 1999)

While decompressing a file, an adaptive Huffman decoder keeps changing its estimate of which letters are likely and which are unlikely according to the letters seen so far in the plaintext. It predicts future letter probabilities based on counting how often each letter has already occurred. The decoder will then decode the next letter using its current table of codewords, which assigns a short codeword to each letter that it predicts is highly likely, and a long codeword to each letter that it predicts is possible but unlikely.

For example, say a text file begins with an extended quotation from "Gadsby" (1939). The simplest approach -- the frequency so far of 'e' is zero, so I'm going to assume that it will continue to be zero -- will not work.

If a Huffman decoder ever gets it in its head that some byte has zero frequency, it is not possible to convince that decoder to emit that byte -- not now, not ever. That byte is no longer part of its dictionary. And so it will not be possible for the decoder to correctly emit plaintext that eventually does use the letter 'e'.

So we must give the Huffman decoder some non-zero estimate of the future frequency of 'e', if we are to have any hope of correctly decoding plaintext that begins with a quotation from "Gadsby", then later includes the letter 'e'. Even an inaccurate estimate -- as long as it's not exactly zero -- allows the decoder to correctly emit plaintext without error. An accurate estimate will produce a shorter compressed file than an inaccurate estimate. Alas, it is difficult to accurately estimate the frequency of something that has never occurred yet.[1]

Note:
Static Huffman decompressors never have this problem -- they can, and often are, told that some byte or another has zero frequency, and everything works out correctly.

Context-mixing adaptive decompressors have a similar problem. To get good compression, the decompressor needs to accurately predict the probability that the next bit is a 1, P(x). After decompressing enough text, the decompressor can see that event A has occurred Ax times, and the next bit was a one in A1 of those times. So in the absence of other information, when A occurs again, it estimates that the next bit is a one with probability P(x|A) = A1/Ax. If both A and B have occurred simultaneously many times ABx before (and the next bit was a 1 in AB1 of those cases), then when A and B occur simultaneously again, our best estimate of P(x|A,B) -- our estimate that the next bit is a one -- is AB1/ABx, completely ignoring A1 and Ax and B1 and Bx.

However, the first time A and B occur simultaneously, we can't do that -- we can't calculate 0/0. All we can do is guess that P(x|A,B) is either the same as P(x|A), or the same as P(x|B), or somehow combine the predictions, possibly using a heuristic such as the Lincoln index that takes into account how many times each event occurred. Given the event counts A1 and Ax and B1 and Bx, what is a good estimate of P(x|A,B)? Alas,

"Probability theory doesn't help us, even if we assume A and B are independent."

References

edit