Data Compression/data differencing

data differencing


Data compression can be viewed as a special case of data differencing.[1] Just as data compression involves 2 processes that often run on 2 different machines -- the "data compressor" and the "data expander" -- data differencing involves 2 processes that often run on 2 different machines -- the "differ" and the "patcher", respectively.

Some algorithms and programs used in data differencing are very similar to algorithms used in "small-window compression" (aka traditional data compression) -- such as "large-window compression" (aka "data folding"[2]) -- while others are very different -- such as "data de-duplication"[3][4] and the rsync algorithm.

A few algorithms used in traditional data compression of large files are similar to data differencing algorithms -- such as some techniques for storing many Huffman frequency tables described in Huffman: storing the frequency table.

Some algorithms and programs developed for data differencing include:

  • diff ("w: delta compression")
  • rsync
  • rdiff
  • lzip
  • rzip
  • IZO (Information Zipping Optimizer)[2]
  • xdelta[5]
  • bsdiff 4 and bspatch[6]
  • bsdiff 6 and Percival's universal delta compressor[6][7]

With other forms of compression, more information generally gives better compression. However, experiments by Factor, Sheinwald, and Yassour 2001 seem to indicate that, when using data differencing with LZ77-like compression, trying to use a large collection of shared files that resemble the target file often gives worse compression than using a single reference file from that collection.[8]

pre-loaded dictionary


When compressing very large files, it usually doesn't make much difference exactly what the dictionary is pre-loaded to, so many systems start with an empty dictionary.

However, pre-loading the dictionary can produce significantly smaller files in two cases:

  • when compressing lots of short strings, or
  • when data differencing -- compressing some big file that is almost the same as some file the receiver already has.

Some implementations of data compression algorithms already have support for pre-loading the dictionary.

  • The popular zlib compression library supports deflateSetDictionary() and inflateSetDictionary() functions for pre-loading the dictionary.[9]

One possible approach to data differencing involves tweaking some compression algorithm to "pre-load" its internal data structures (it's "window" or "dictionary") with data from the files we already have, to allow the compression algorithm to use that data to (hopefully) give better compression on the next file.[10][11]

For example, the IPSW algorithm (the in-place sliding window of Shapira and Storer) initializes the source sliding window of a LZ77-like decompressor to some shared file S,[8] allowing common substrings of the target file T to be copied out of S, as well as the (like other LZ77-like compressors) allowing repeated substrings to be copied from previously-decompressed parts of the target file T, or if all else fails, from literal values in the compressed file.

Sometimes such "pre-loading" implementations use unmodified "small-window compression" algorithm, as in US Patent RE41152 2010. Many early compression algorithms pre-loaded the dictionary with a "static dictionary" (Huffword, PalmDoc, etc.). The LZW algorithm gives better compression than the very similar LZ78 algorithm. One way of thinking about LZW is to imagine that the 256 literal byte values are not a separate "special case", but are, in effect, pre-loaded into the dictionary; while the LZ78 algorithm, in effect, starts with an empty dictionary, and so gives worse compression.

With many compression algorithms, such pre-loading is limited by a "window" or "history limit" -- when the window or limit is, say, 32 kByte, it doesn't matter what or how much data you try to pre-load, only the last 32 kByte is going to make any difference.

This leads some people working on data differencing to use a window or history limit much larger than other compression researchers.

window size


The LZ77-style decompressors have a fixed-size "window" of recently-decompressed text, and the "copy references" can only refer to text inside that window.

Many compression algorithms, in theory, have no inherent "history limit" -- such as LZ78-style algorithms and adaptive Huffman algorithms. However, in practice, most implementations of compression algorithms periodically reset their dictionary and start fresh. So they have a block size the length of the maximum text between dictionary resets.

With many early compression algorithms, an easy way to improve performance is to increase the window size. (i.e., either literally increase the "window" buffer for LZ77 algorithms, or simply reset the dictionary less often for LZ78-style algorithms). These early compression algorithms were typically developed on machines that had far less RAM than modern machines, and so the "window size", the "reset frequency", and "internal dictionary size" were constrained by that limit. (Those early machines also ran much slower than modern machines, and many algorithms -- such as the LZRW series of algorithms -- were heavily constrained by these speed limits; it is unclear what effect this had on dictionary size).

For simplicity, we will consider the effect of doubling the window size on a variety of algorithms. The original "small window" algorithm has some fixed window size W, and the widened "larger window" algorithm has some fixed window size 2*W, which can be thought of as the "near window" with all the same stuff in it as used by the small-window algorithm, and the "far window" that has stuff that is inaccessible to the small-window algorithm, but perhaps the larger-window algorithm can exploit that stuff to get better compression.

Alas, there are diminishing returns to increasing the size of the window. In fact, almost always there is some data-dependent "optimum" window size. Increasing the size of the window beyond the optimum leads to worse compression (larger compressed files).

There are 4 reasons that windows larger than that optimum size are counter-productive:

1. Some LZ77-style algorithms use a fixed-length linear offset. Doubling the size of the window necessarily increases the length of each and every copy item by 1 bit. With typical 16-bit copy items, that makes the compressed file longer (worse compression) by a factor of 17/16, unless the compressor can find strings in the far window with matches that are longer than any strings in the near window.

2. Other LZ77-style algorithms use variable-length offsets. Typically more distant offsets are longer. Doubling the size of the window generally leaves the length of "extremely close" copy items exactly the same, increases the size of a few copy items near the outer limit of the near window by 1 bit, and requires even larger copy items to refer to stuff in the far window. So there's hardly any penalty for increasing the window size for this (2) category compared to the (1) category. However, the wins are not quite as good. In many cases, even when there is an excellent long match in the far window, longer than any match in the near window, it doesn't make the compressed file any smaller -- the long copy item needed to refer to that distant matching string may be the same length or even longer than two short copy items that can re-assemble that same string from two places in the near window.

3. As Jeff J. Kato et al point out,[12] It can even be counter-productive to have data from one kind of file in the window/dictionary when trying to compress data with somewhat different characteristics, or when trying to compress a file whose characteristics change ("evolve") from one part to the next. Often it's better to reset the dictionary and start from scratch.

However, sometimes these "large window" compressors have great gains -- in particular, when we're currently trying to compress a file that is not merely the same kind of file, but is bit-for-bit identical to some early file. Some large-window techniques also work well when a file is *mostly* identical with relatively few edits here and there.

The default window size or block size for some compression software is:

  • gzip: 32 kByte sliding window
  • bzip2: blocks of 900 kB
  • DEFLATE (as used in ".zip", ".jar", ".png", etc.): 32 kByte sliding window
  • The GIF standard mentions two kinds of LZW encoders:
    • some GIF encoder implementations reset the dictionary every time it gets full --

i.e., the compressor sends the "clear code" to reset the dictionary after every (max dictionary size - number of hard-wired entries) = 2048-(256+2) compressed symbols.

    • All GIF decoders are required to support "deferred clear". Many images can be stored in a smaller .gif compressed file if the compressor uses such a "deferred clear".[13][14][15]

). The GIF LZW encoder becomes non-adaptive after the dictionary is full, while it is "waiting" for the next clear code.

  1. w:data differencing
  2. a b
  3. Dutch T. Meyer and William J. Bolosky. "A Study of Practical Deduplication".
  4. Wikipedia: data deduplication
  6. a b "By using suffix sorting (specifically, Larsson and Sadakane's qsufsort) and taking advantage of how executable files change, bsdiff routinely produces binary patches 50-80% smaller than those produced by Xdelta, and 15% smaller than those produced by .RTPatch ... A far more sophisticated algorithm, which typically provides roughly 20% smaller patches, is described in my doctoral thesis." -- Colin Percival. "Binary diff/patch utility"
  7. Colin Percival. "Matching with Mismatches and Assorted Applications". thesis. 2006.
  8. a b Dana Shapira and James A. Storer. "In Place Differential File Compression". 2005.
  10. "Can compression algorithm “learn” on set of files and compress them better?"
  11. comp.compression: "Employing a set of data files guaranteed to be available at the time of decompression"
  12. Jeff J. Kato, David W. Ruska, David J. Van Maren. US Patent 4847619 "Performance-based reset of data compression dictionary". 1987.
  13. Attila Tarpai. "Deferred clear code in GIF". 2010.
  14. "Cover Sheet for the GIF89a Specification: Deferred clear code in LZW compression"
  15. GIF file format: Portability Warning 2: The Deferred Clear Code Problem