Data Compression/Dictionary compression

Dictionary CompressionEdit

Remember the wealth of information a soldier gets from a mike click? With certain types of data, it makes sense that if you could encode the data using information that is not stored in the actual datafile you are compressing, you could significantly reduce known forms of storage and the requirements for communication. Some algorithms replicate process by including an internal data table, often referred to as dictionary, hence the term dictionary compression.

Note:
The word dictionary in this case does not relate to the programming structure that holds the data table. Some authors and implementors may refer to this type of concept as library (and then library compression), in that case library does not refer to the code implementation file.
This can become confusing since the algorithm relies on an internal data storage container that references the information that may in really not be what programmers define as a dictionary, also adding to it is the fact most compression algorithms tend to be implemented into as programming object libraries files, so the algorithm can easily be reusable and updated.

Due its intrinsic characteristic of permitting gains when used in communications, as the example of the mike click demonstrates, there is a pre-understood code system agreed and shared between the emitter and the receptor. It is not without a cause that most communication network oriented compression algorithms rely in some form of dictionary compression.

Clipboard

To do:
This should be extended a bit, keeping it inside the scope by inclusion of some mentions of communication protocols V.42bis using Lempel Ziv dynamic dictionary compression, V.44 uses LZJH (Lempel-Ziv-Jeff-Heath) adaptive data compression.

An example is a simple dictionary built into the compression algorithm scheme. If you already know the most compact storage scheme for a specific type of data, then you can simply substitute the more compact state for the actual data. Consider the word "supercalifragiliciousexpialadocious" Now suppose that in your dictionary there are only 11 words starting with super, and this is the eleventh. Well its obvious that Super11 would be unique enough that given the same dictionary at the time of decompression, you could recover the full word with no data loss. With A and An, however it is not quite so obvious that you could store information about the word in less space than everything but the one letter word.

Now consider that you have the Internet, and you want to store an article by LaVogue. You could store the file on your own computer, or, alternately you could store a link to the file similar to the hypertext links you have on your homepage. Is it compression? Maybe not, but you can recover the original text with a slight delay while it downloads.

The mind boggles at the efficiency of a compression system that contains all knowledge in the world on a single server-farm, and uses it to reduce the storage of information on the client computer. Even more fantastic would be the legal snarl such a system would create and the privacy concerns and financial costs associated with recovery of compressed information that was proprietary in the first place. But a net-aware compression server, might be able to achieve truly radical compression rates. Essentially all that the home file would have to contain was the key sequence to trigger the decompression of a specific data cloud. The home user could even compress his compression files because multiple files each containing a link to a specific compression file structure, could be themselves turned into a single file, that contained the compression file structure of all the composite files. Infinite compression at the user level, but truly massive storage at the server level.

non-adaptive dictionary compressionEdit

Text compression using 4 bit codingEdit

Pike's 1981 compression algorithm, like several early text compression algorithms, used a more or less literal dictionary of English words.

Pike's algorithm used a fixed dictionary composed of 205 of the most popular English words, and the ASCII characters. By using a variable-length coding of 4, 8, or 12 bits to represent these words and the remaining ASCII characters, and some clever handling of uppercase letters and the space between words, Pike's algorithm achieved a compression ratio of 3.87 bits per character for typical English text. [1]

Adding more English words to the dictionary helps compression, but with diminishing returns. Adding certain non-words to the dictionary -- common letter pairs and triplets, common prefixes, common suffixes, etc. -- helps compression, since they can be used to at least partially compress words that aren't in the dictionary, but adding such non-words also gives diminishing returns.

Later compression algorithms kept the idea of a short codeword representing something longer, but generalized it: a dictionary coder expands codewords to a sequence of bytes, a sequence that may or may not line up on English word boundaries.

SmazEdit

Smaz is a simple compression library suitable for compressing very short strings, such as URLs and single English sentences and short HTML files.[2]

Much like Pike's algorithm, Smaz has a hard-wired constant built-in codebook of 254 common English words, word fragments, bigrams, and the lowercase letters (except j, k, q). The inner loop of the Smaz decoder is very simple:

  • Fetch the next byte X from the compressed file.
    • Is X == 254? Single byte literal: fetch the next byte L, and pass it straight through to the decoded text.
    • Is X == 255? Literal string: fetch the next byte L, then pass the following L+1 bytes straight through to the decoded text.
    • Any other value of X: lookup the X'th "word" in the codebook (that "word" can be from 1 to 5 letters), and copy that word to the decoded text.
  • Repeat until there are no more compressed bytes left in the compressed file.

Because the codebook is constant, the Smaz decoder is unable to "learn" new words and compress them, no matter how often they appear in the original text.

Adaptive dictionary algorithmsEdit

Most dictionary algorithms are adaptive: rather than a fixed, hard-wired dictionary like Pike's algorithm, as the decoder compresses the text, when it encounters letter triplets or entire words that are not yet in the dictionary, the decoder adds them to the dictionary.

If there are too many rare triplets, syllables, or words in the dictionary, it increases the codeword size of the more common ones, hurting compression ratio. If some triplet, syllable, or word is left out of the dictionary, then when it does occur, it cannot be represented by a simple dictionary reference and must be spelled out using more bits, also hurting compression ratio. This balancing act makes it difficult to build an optimal dictionary.[3]

The hard-wired dictionary used to initialize the data compression algorithm has a big effect on the compression ratio in the early phase of the compression. An adaptive dictionary algorithm eventually adds to the dictionary all the syllables and common words used in that specific text, and so the initial hard-wired dictionary has little effect on the compression ratio in later phases of the compression of large files.[3]

LZ77 algorithmsEdit

LZ77 type decoders expand the compressed file by expanding "copy items" in that compressed file into larger strings. This enables "duplicate string elimination": only the first copy of a string is stored in the compressed file. All other copies refer to a previous copy.

These decoders have the general algorithm:

  • Is the next item a literal? Then copy it directly from the compressed file to the end of the decoded text.
  • Is the next item a copy item? Then break it into a "length" and a "distance". Find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

While all LZ77 style algorithms have this same decoder structure, there is a huge variety of ways that decoders decide whether an item is a literal or a copy items, and also a variety of ways that they extract the "length" and the "distance" from the copy item.

Many decoder algorithms combine LZ77 with other compression ideas. As of 2008, the most popular LZ77 based compression method is called DEFLATE; it combines LZ77 with Huffman. We will discuss it later.

LZSS and LZRW1-AEdit

Lempel-Ziv-Storer-Szymanski (LZSS) was created in 1982 by James Storer and Thomas Szymanski. LZRW1-A was published in 1991 by Ross Williams.

The LZSS and LZRW1-A decompressors have a very similar form:

For each copy item, fetch a "literal/copy" bit from the compressed file.

  • 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
  • 1: copy item: the decoder grabs the next 2 bytes from the compressed file, breaks it into a 4 bit "length" and a 12 bit "distance". The 4 "length" bits are decoded into a length from 3 to 18 characters. Then find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

LZJBEdit

The LZJB[4] decoder fetches a "literal/copy" bit from the compressed file.

  • 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
  • 1: copy item: the decoder grabs the next 2 bytes from the compressed file, breaks it into a 6 bit "length" and a 10 bit "distance". The 6 "length" bits are decoded into a length from 3 to 66 characters. Then find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

Note:
At first look, a 2-byte "copy" in LZRW1A and LZJB algorithms may seem to not save any space, because the 16 bits to indicate "length, distance" are the same size as 2 literal bytes. However, 2 literal bytes actually require 18 bits including the "literal/copy" bits, while a 2 byte copy requires 17 bits. Even so, the 2 byte copy was either overlooked or not considered worth the effort by the authors of those algorithms.

After pulling a 1 bit from the control word, a copy item is in only one format:

  • LLLLLLdd dddddddd : copy L+3 bytes from d bytes before the most-recently-decoded byte.

LZ77Edit

LZ77[5] The original LZ77 decoder fetches a "distance/length/literal" block from the compressed file.

The LZ77 decoder algorithm is:

  • Find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text. (Sometimes the "length" is 0, so this "copy" does nothing).
  • Then copy the literal character from the compressed file to the end of the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

BARFEdit

The "BARF" archiver[6] (when it's not cheating) uses a simple byte-oriented LZ77-like code. The inner loop is:

  • Fetch a byte X.
    • X in the range 0...31? Literal: copy the next X literal bytes from the compressed stream to the end of the decoded text.
    • X in the range 32...255? Copy item: copy the two bytes from X-32 places back in the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

LZFEdit

LZF is a fast compression algorithm that takes very little code space and working memory (assuming we have access to the most recent 8 kByte of decoded text).

LibLZF by Marc Lehmann is designed to be a very small, very fast, very portable data compression library for the LZF compression algorithm.

Clipboard

To do:
Is LZF also developed by Marc Lehmann ?.

[3] LibLZF source code [4] [5] [6] [7] [8] [9] [10] [11] [12]

LZF is used in TuxOnIce and many other applications.

The LZF decoder inner loop is: [13] [14] [15] For each copy item, the LZF decoder first fetches a byte X from the compressed file, and breaks the byte X into a 3 bit length and a 5 bit high_distance. The decoder has 3 cases: length==0, length==7, and any other length.

    • length == 0? (i.e., X in the range 0...31?) Literal: copy the next X+1 literal bytes from the compressed stream and pass them straight through to the current location in the decompressed stream.
    • length in 1...6 ? short copy: Use that value as length (later interpreted as a length of 3 to 8 bytes).
    • length == 7? long copy: fetch a new length byte and add 7. The new byte could have any value 0 to 255, so the sum is 7 to 262 (which is later interpreted as 9 to 264 bytes).
    • Either kind of copy: Fetch a low_distance byte, and combine it with the 5 high_distance bits to get a 13-bit distance. (This implies a 2^13 = 8KByte window).
    • Either kind of copy: Find the text that starts that "distance" back from the current end of decoded text, and copy "length+2" characters from that previously-decoded text to end of the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

Each LZF "item" is one of these 3 formats:

  • 000LLLLL <L+1>  ; literal reference
  • LLLddddd dddddddd  ; copy L+2 from d bytes before most recently decoded byte
  • 111ddddd LLLLLLLL dddddddd  ; copy L+2+7 from d bytes before most recently decoded byte

FastLZEdit

FastLZ by Ariya Hidayat is a free, open-source, portable real-time compression library for the FastLZ algorithm. (Ariya Hidayat apparently developed the FastLZ algorithm based on the LZF algorithm?) FastLZ is a drop-in replacement for Marc Lehmann's LibLZF. It has pretty much the same compressed size and compression time as LibLZF, and significantly faster decompression -- about 2/3 the time of LibLZF. [7][8]

The FastLZ decoder inner loop is very similar to the LZF inner loop:

  • Fetch a byte X. Divide the byte X into the 3 bit length and a 5 bit high_distance.
    • length == 0? (i.e., X in the range 0...31?) Literal: copy the next X+1 literal bytes from the compressed stream to the end of the decoded text.
    • length in 1...6 ? Copy: Use that value as length (later interpreted as a length of 3 to 8 bytes).
    • length == 7? Copy: fetch a new length byte and add 7. The new byte could have any value 0 to 254, so the sum is 7 to 261 (which is later interpreted as 9 to 263 bytes).
    • (When the new byte is all-ones (255), it serves as an escape to even longer lengths ...)
    • Fetch a low_distance byte, and combine it with the 5 high_distance bits to get a 13-bit distance. (This implies a 2^13 = 8KByte window).
    • If the distance is all-ones (0x1FFF): fetch 2 new distance bytes to give a full 16 bit distance, and add it to the 0x1FFF all-ones distance. (This implies a 2^16 + 2^13 = 64 KByte + 8 KByte window)
    • find the text that starts that "distance" back from the current end of decoded text, and copy "length+2" characters from that previously-decoded text to end of the decoded text.
  • Repeat from the beginning until there is no more items in the compressed file.

If a compressed file only uses distances within the 2^13-1 window, and lengths less than 2^8+7, both the FastLZ and the LZF decoders will produce the same decompressed plaintext.

Each FastLZ "item" is one of these 5 formats (the LZF formats, plus a few extensions):

  • 000LLLLL <L+1>  ; literal reference
  • LLLddddd dddddddd  ; copy L+2 bytes from d bytes before most recently decoded byte
  • 111ddddd LLLLLLLL dddddddd  ; copy L+2+7 bytes from d bytes before most recently decoded byte
  • LLL11111 11111111 dddddddd dddddddd ; copy L+2 bytes from d+0x1FFF bytes before most recently decoded byte
  • 11111111 LLLLLLLL 11111111 dddddddd dddddddd ; copy L+2+7 bytes from d+0x1FFF bytes before most recently decoded byte

It's not clear why FastLZ is so much faster than LZF. It's "extensions" are apparently not being used, since the compressed file sizes on the benchmark files are not significantly different. If the same implementation techniques -- 16-bit copies whenever possible, special-case for "run" (pseudo-RLE), etc. -- were applied to LZF, would it be just as fast? Or is it the case that these "extensions" actually are being used a lot, and somehow they are faster (but roughly the same size) as the LZF approach?

miniLZOEdit

Clipboard

To do:
Complete miniLZO.

LZO is a compressed data format and portable lossless data compression library written in ANSI C developed and maintained by Markus F.X.J. Oberhumer.[9] LZO is slightly slower and produces somewhat better compressed file sizes than LZF. LZO is much faster than DEFLATE/gzip -- it runs in about half the time -- and produces somewhat worst file sizes than DEFLATE/gzip. [16]

miniLZO is a lightweight subset of the LZO library.

LZSEdit

LZS: Lempel–Ziv–Stac compression. LZS compression is specified for various Internet protocols, including RFC 1967, RFC 1974, RFC 2395, and RFC 3943.[10] Stac Electronics sued Microsoft for infringement of LZS patents. Since then, several of those patents have expired.

The LZS decoder fetches a "literal/copy" bit from the compressed file.

  • 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
  • 1: copy item: the decoder grabs the next bit from the compressed file.
    • That bit selects how many bits of "distance" to pull from the file:
      • 1: the decoder grabs a 7 bit "distance".
        • If this distance is the nonsensical value "0", it really represents the end-of-message. Exit the loop.
      • 0: the decoder grabs a 11 bit "distance" (this implies a 2 kibiByte sliding window).
    • Then the decoder grabs a "length" value from the compressed file.
      • However, the length is not encoded in a fixed-length field, but in a variable-length pattern:
      • 00 represents length 2
      • 01 represents length 3
      • 10 represents length 4
      • 1100 represents length 5
      • 1101 represents length 6
      • 1110 represents length 7
      • 1111 0000 represents length 8
      • 1111 0001 represents length 9
      • 1111 1111 0010 represents length 25
      • In general, except for the first 6 special cases listed above, "1111" repeated N times followed by 4 more bits that represent a value Y in the range 0...14, are decoded into a length value (N*15 + Y - 7).
    • Then find the text that starts that "distance" back from the current end of decoded text, and copy "length" characters from that previously-decoded text to end of the decoded text.
  • Repeat from the beginning, until the special "zero distance" value was decoded.

The complex variable-length encoding technique used in LZS allows "recent" byte pairs to be compressed into 11 bits, and even "less recent" byte pairs, as long as they are in the 2 kibiByte sliding window, to be compressed in 15 bits. This variable-length encoding technique is one of several kinds of "universal code", one of many kinds of "fixed Huffman encoding" tables, which we discuss in the Entropy chapter of this Wikibook.

In text files, repeated pairs are far more common than longer repeated strings. LZS gives at least 1 bit of compression on this common form of redundancy. This gives LZS an advantage over compression algorithms which are unable to compress repeated strings shorter than 3 characters, and so must store such short repeats as-is, or even expand them by a bit.

SnappyEdit

The "Snappy" algorithm is tuned for very high speeds (even at the cost of mediocre compression). On benchmark files, Snappy is an order of magnitude faster than zlib but compressed files 20% to 100% bigger than zlib.[11] Snappy is faster than LZF.[12]

The inner loop of the "Snappy" decompressor decodes "elements" that always begin with a tag byte in one of these formats:

  • xxxxxx00 ; literal follows
  • xxxxxx01 ; copy with 1-byte offset
  • xxxxxx10 ; copy with 2-byte offset
  • xxxxxx11 ; copy with 4-byte offset

Details:

  • LLLLLL00 <L+1> ; where L in 0..59: literal reference for 1...60 (inclusive) literal bytes.
  • LLLLLL00 <L-59> <length> ; where L in 60..63: these special "lengths" indicate 1, 2, 3, or 4 bytes, respectively, of actual length bytes follow the tag byte. Those length bytes are followed by the literal itself.
  • dddLLL01 dddddddd ; copy with 1-byte offset: copy L+4 bytes from d bytes before the most recently decoded byte. (Why L+4, giving lengths 4..11 ? Would L+3, giving lengths 3..10, give better compression? Perhaps even L+2 allowing us to copy the occasional byte pair?)
  • LLLLLL10 dddddddd dddddddd ; copy with 2-byte offset: copy L+1 bytes from d bytes before the most recently decoded byte.
  • LLLLLL11 dddddddd dddddddd dddddddd dddddddd ; copy with 4-byte offset: copy L+1 bytes from d bytes before the most recently decoded byte.

PalmDocEdit

PalmDoc combines LZ77 with a simple kind of byte pair compression. PalmDoc Algorithm: [13]

PalmDoc files are decoded as follows:

  • Read a byte from the compressed stream. If the byte is
    • 0x00: "1 literal" copy that byte unmodified to the decompressed stream.
    • 0x09 to 0x7f: "1 literal" copy that byte unmodified to the decompressed stream.
    • 0x01 to 0x08: "literals": the byte is interpreted as a count from 1 to 8, and that many literals are copied unmodified from the compressed stream to the decompressed stream.
    • 0x80 to 0xbf: "length, distance" pair: the 2 leftmost bits of this byte ('10') are discarded, and the following 6 bits are combined with the 8 bits of the next byte to make a 14 bit "distance, length" item. Those 14 bits are broken into 11 bits of distance backwards from the current location in the uncompressed text, and 3 bits of length to copy from that point (copying n+3 bytes, 3 to 10 bytes).
    • 0xc0 to 0xff: "byte pair": this byte is decoded into 2 characters: a space character, and a letter formed from this byte XORed with 0x80.
  • Repeat from the beginning until there is no more bytes in the compressed file.

dictionary algorithmsEdit

fixed byte pair encodingEdit

The simplest dictionary algorithm is byte pair encoding with a fixed dictionary of byte pairs (a bigram dictionary).

Simple byte-pair encoding, also called dual-tile encoding or DTE, is often used to compress text in video game ROMs.[14]

A byte pair decoder assumes that English text uses only a relatively small "alphabet" of letters and symbols. A byte pair decoder assumes that certain bytes never occur in English language text (the "high" bytes 0x80 to 0xFF and a few other bytes). When the "impossible" bytes occur in the compressed text, the decoder looks that byte up in a fixed dictionary, and emits the corresponding pair of bytes. Otherwise, the decoder copies the compressed byte to the plaintext more or less as-is. Often the 0x00 byte marks the end of the compressed string, so standard C string-handling routines can be re-used to handle compressed text. The dictionary should include all the most-frequent letter pairs used in the plaintext ("th", "he", " t", etc.). There are several different implementations of this algorithm, each one using a different (incompatible) dictionary.[15][16]

Clipboard

To do:
... should I say something about "adaptive coding" here? ...


byte pair encodingEdit

DTE can be decoded in a single pass -- each byte of compressed text is either a "normal" printable character that stands for itself, or else it is a "unprintable" byte that stands for exactly two normal printable characters. Recursive byte pair encoding loosens that restriction: With recursive byte pair encoding, a "unprintable" byte stands for two bytes. Sometimes both of those two bytes are normal printable characters, just like DTE. Sometimes one or both bytes may be (some other) "unprintable" byte, requiring repeated decoding passes until only printable characters remain.[17]


[18] [19]

SCZ byte pair encodingEdit

SCZ is a dynamic byte pair algorithm and compressed format developed by Carl Kindman, who released one implementation and its test suite under a LGPL license.[20]

The SCZ compressed format is a series of segments. Each segment has a few bytes of header and footer, including a iteration count and a checksum, surrounding a repeatedly-compressed block.

The inner loop of the SCZ decoder is given a dictionary of 256 entries, and a special escape symbol. The inner loop scans through a block of "compressed segment data" and generates a new uncompressed block something like this:

  • Read the next byte from the compressed block.
  • If the byte read is the special escape byte, discard the escape byte and read the following compressed byte as a literal byte, and write that literal byte to the decompressed block.
  • Otherwise look up that byte in the dictionary.
    • If the byte is a "marker character", the dictionary has a byte pair to replace it with -- write those 2 bytes to the decompressed block.
    • Otherwise the dictionary indicates the byte is a literal byte -- write the byte read to the decompressed block.
  • Repeat until no more data in the compressed block.

The "escape byte" makes it possible to compress a file that has a byte pair repeated many times, even when the file already uses all 256 possible byte values -- the compressor deliberately expands some rare byte into two bytes (the escape byte and the rare byte), in order to re-use that rare byte as a "marker byte" to represent that byte pair (compressing each instance of those two bytes into one byte).

The middle loop of the SCZ decoder is given an iteration count and a block of data. The middle loop and generates the corresponding plaintext something like this:

  • If the iteration count is zero, the rest of the block is fully decompressed data -- done!
  • Read the header from the compressed block
    • read a byte from the compressed block; this byte becomes the new escape symbol.
    • Read the symbol count N, and read N dictionary entries into a dictionary. (Each entry is 3 bytes: the "marker" byte as stored in the compressed text, and the pair of bytes it represents in the uncompressed text.
  • Pass the escape symbol, the dictionary, and the rest of the compressed block (the compressed segment data) to the inner loop, decompressing it to an uncompressed block.
  • swap the (now-empty) compressed block with the uncompressed block.
  • Decrement the iteration count and repeat from the beginning.

The outer loop of the SCZ decoder breaks the file up into a series of segments, breaks each segment up into a header (which contains the iteration count, among other things), a block of compressed data (which the outer loop passes to the inner loop for decoding), and a footer (which contains a checksum used to detect problems).

Note:
A common misconception is that data compression algorithms can compress practically any block of data. That leads to the common misconception that repeated applications of a compression algorithm will keep shrinking the data further and further.

The vast majority of compression algorithms squeeze as much as they can in a single iteration. Applying a second iteration (attempting to produce a "double-compressed" file) almost invariably produces a file that is larger than the first compressed file; further iterations only make it worse (larger). SCZ is one of the few compression algorithms that takes multiple iterations. But even a SCZ compressor eventually reaches minimum size; further iterations only make the "compressed" file worse (larger).

ISSDC: digram codingEdit

Clipboard

To do:
fill in details

LZ78 algorithmsEdit

LZ78

LZWEdit

A well-known example of the dictionary based technique is the LZW data compression, since it operates by replacing strings of essentially unlimited length with codes that usually range in size from 9 to 16 bits.

The LZW algorithm, as used in the GIF file format, is perhaps the most famous and controversial compression algorithm.

While decoding a codeword to a "word" of 1 or more letters and emitting those letters to the plaintext stream, a LZW decompressor adds a new "word" to the dictionary. The new word is the concatenation of the previously-decoded word and the first letter of the word represented by the current codeword.

Larger codewords allow larger dictionaries, which typically improves compression.[21]

[22]

GIFEdit

The GIF file format typically starts with 9 bit codewords and gradually increases the width of the codeword, as necessary, up to a maximum codeword width of 12 bits. Many GIF implementations statically allocate a dictionary with 2^12 = 4096 dictionary entries, one for each possible codeword.

LZMWEdit

LZMW (1985) by V. Miller, M. Wegman is a modification of LZW.

While decoding a codeword to a "word" of 1 or more letters and emitting those letters to the plaintext stream, a LZMW decompressor adds a new "word" to the dictionary. The new word is the concatenation of the previously-decoded word and the entire word (possibly the same word) represented by the current codeword. This allows dictionary entries to grow in length much more rapidly than LZW.

LZAPEdit

LZAP (1988) by James Storer) is a modification of LZMW.

While decoding a codeword to a "word" of 1 or more letters and emitting those letters to the plaintext stream, a LZAP decompressor adds at least 1 and often more new "words" to the dictionary. The new words are the concatenation of the previous match with every initial substring of the current match. ("AP" refers to "all prefixes"). For example, if the previous match is "com" and the current match is "press", then the LZAP decompressor adds 5 new "words" to the dictionary: "comp", "compr", "compre", "compres", and "compress", where LZW would only add "comp", and LZMW would only add "compress".

This allows the number of words in the dictionary to grow much more rapidly than LZW or LZMW.

LZWL and word-based LZWEdit

In 2005, Jan Lánský introduced syllable-based compression. .[3]

The LZWL compressor uses a syllable detector to divide the plain text into syllables. [23][24] When the LZWL decompressor pulls a "phrase index" from the text, it adds a new phrase to the dictionary -- formed from the entire previous phrase concatenated with the first syllable of the current phrase.

A similar "word-based LZW" divides the plain text into English words (in general, alternate maximal strings of alphanumeric characters and non-alphanumeric characters).[25]

Character-based compression seems to give better compression with smaller files; word-based compression seems to give better compression with larger files. Some researchers suggest that syllable-based compression methods might give better compression with intermediate-sized files -- such as HTML pages.[3]

Clipboard

To do:
Does the decompressor need to use the syllable detector?
Is there any limit on the number of plaintext bytes in a syllable?

LZW implementation tipsEdit

The dictionary used by the decompressor doesn't really need to store a literal string for each dictionary entry. Since each dictionary entry is composed of some previous dictionary entry concatenated with a single character, all the decompressor needs to store for each entry is an integer and a character: the index of some previous dictionary entry, and a literal suffix character. [21]

This same compact dictionary also works with LZAP. With LZMW, rather than each entry storing a single suffix character, it will instead store a second integer: the index to some, possibly the same, previous dictionary entry.

It's usually simpler and faster for the decoder to special-case decode any codeword with a value less than 256, and immediately output the literal byte it represents, rather than go through the motions of looking up such a codeword in the dictionary. [21]

Some LZW compressors use a hash table to rapidly map the next few plain text characters to a dictionary entry. This works about the same as the hash table used to speed up LZ77-style compressors.

Some implementations of LZW and other LZ78 variants use a special search tree that takes advantage of the dictionary structure.[26]

LZXEdit

(We discuss LZX in a later chapter, Data Compression/Order/Entropy#LZX)

Reduced offset Limpel Ziv (ROLZ)Edit

The above LZ77 variations of LZ77 encode a linear, uncoded, fixed-length offset. The possible values of that offset at any particular instant of decoding often include many values that will *never* be used. For example, once the decompressor pulls a copy length of 4, it may see dozens of copies of "the " and "and " in the text, but it knows that the compressor will always pick the most recent one -- all the values of the offset that select some other copy of "the " or "and " will never occur.

If we could somehow "skip over" those impossible offsets, we could encode the offset in fewer bits, and still be able to use the same (offset,length) phrase that the above LZ77 encoders would use.

Also, many possible values of the offset are highly unlikely to occur. For example, once the decoder finishes emitting a phrase that ends in "ch", in normal English text the next character is almost never another consonant. So all the offsets that point to a phrase starting with some consonant will almost never be used. By skipping over those highly unlikely offsets, we could encode the offset of the next phrase in fewer bits, and almost always be able to copy the same (offset, length) phrase that the above LZ77 encoders would use. (In some rare cases, the above LZ77 encoder picks some phrase that a ROLZ algorithm deems "unlikely" and skips over, forcing us to copy some other, shorter phrase; but hopefully sacrificing a few bits in those situations is paid back in saving a few bits in almost every other situation).

By "reduced offset", we mean that we have reduced the number of bits in the compressed file used to select a particular offset, when compared to using a linear, uncoded, fixed-length offset. We do *not* mean that the maximum possible offset is reduced -- in fact, most ROLZ compressors can copy a phrase from a much more distant offset than the typical 2^10 or 2^12 byte limit of linear, uncoded offsets.

Using something other than a linear, uncoded, fixed-length offset necessarily requires more computation at decode-time to translate that compressed selection value to the actual position in the decompressed-so-far text. This is yet another tradeoff between speed and compression ratio.

Ross Williams points out that all LZ77 and LZ78 algorithms can be cast into the form of Markov algorithms.[27]

Malcolm Taylor wrote one (the first?) implementation of ROLZ as part of the WinRK compression suite. Ilia Muraviev wrote the first open-source implementation of ROLZ compression.[28]

LZRW3-a Of all the compressors in the LZRW series of "fast" data compressors, the LZRW3-a(8) gives the best compression on the large text benchmark. Unlike most other ROLZ compressors, it completely ignores the most-recently-decoded byte of plaintext context. The inner loop of the LZRW3-a decoder first fetches a "literal/copy" bit from the compressed file.

  • 0: literal: the decoder grabs the next byte from the compressed file, which represents itself. Pass it straight through to the decompressed text. Every literal byte is "interesting", so a pointer to this byte is (eventually) added to the "table of 2^12 interesting locations in the previously-decoded text". Hopefully next time this byte can be copied as part of a long phrase via that pointer, rather than laboriously spelled out with literals again.
  • 1: copy item: the decoder grabs the next 2 bytes from the compressed file, which represents a phrase. The decoder breaks it into a 4 bit "length" and a 12 bit selector.
  • The 4 "length" bits are decoded into a length from 3 to 18 characters.
    • It uses that 12 bit selector to pick a location from the "table of 2^12 interesting locations in the previously-decoded text", and copies length characters from that location to the end of the decoded text. Also: add to the table a pointer to the beginning of this phrase.
  • Repeat from the beginning until there is no more items in the compressed file.

LZRW3-a(8) tries to (approximately) preserve the 8 most-recent instances of every 3-byte phrase that occurs in the plaintext. (The specific details of which partition the decompressor adds the pointer to, and the pseudo-random replacement used to discard old pointers when a partition is full, are beyond the scope of this document). (In effect, the decompressor expands a 12 bit code to a 20 bit uncoded offset into the 1 MB plaintext buffer). (too much detail)


LZRW4 (Ross Williams, 1991) can be seen as a kind of ROLZ algorithm. Rather than a linear, uncoded offset for each phrase, the offset is represented by a 5 bit value. At the start of each phrase, the LZRW4 decoder hashes the two previous characters to obtain a partition number. The 5 bit value selects one of the 32 pointers in the context-selected partition. That points to some previously-decoded location in the 1 MB plaintext buffer. (In effect, the decompressor has expanded the 5 coded bits to a 20 bit uncoded offset). Then the decompressor pulls 3 bits from the compressed text and decodes that into 8 possible length values (2,3,4,5,6,7,8,16), then copies the selected (offset,length) phrase to the plaintext. Then the context partitions are updated. Because a "copy item" is only 9 bits (flag bit + offset selector + length), it does a good job compressing common byte pairs. Literal items are also 9 bits (flag bit + literal octet). (To increase speed, Williams doesn't store a pointer to *every* time a 2-byte context occurs in the partition table, but only the 2 byte contexts that occur at the *end* of some phrase). (To reduce memory use, Williams doesn't keep track of the *exact* 2-byte context using 2^16 partitions, but instead hashes the 2 previous bytes of context into a 12 bite context with 2^12 partitions). (too much detail ...)


Perhaps the most extreme form of "reduced offset" is to reduce the offset to zero bits. This special case of ROLZ is called LZP.

LZPEdit

LZP compression was invented by Charles Bloom.[29] A LZP decompressor examines the current context (perhaps previous 3 bytes), finds the most recent occurrence of that context in the plain text (or some approximation), and uses only that one fixed offset. That gives us an extremely short compressed phrase (a flag bit + a match length). In some cases, that one fixed offset is the same one that LZ77 would have used -- in these cases, we've saved a lot of bits because we didn't store the offset. In other cases, that one fixed offset is not exactly the same one that LZ77 would have used, but a larger offset that gives us exactly the same copied phrase -- likewise, we've saved lots of bits. Inevitably, sometimes that one fixed offset gives us fewer useful bytes than the offset LZ77 would have chosen -- in the worst case, zero useful bytes. LZP requires more bits to encode those bytes than LZ77 would have done in that situation. But overall, LZP typically compresses better than LZ77.

A few implementations make sure the context table always points exactly to the most recent occurrence of every 3 byte context -- this requires updating the context table on every byte the decompressor emits. Many LZP implementations approximate this by only updating the table at the start of a phrase or literal, skipping the updates for all the bytes copied during that phrase. That speeds up decompression, and Arturo San Emeterio Campos[29] claims it actually improves the compression ratio on longer files, although it hurts the compression ratio on shorter files.

A LZP decoder works something like this:

  • Look up the current context in the context table to get the Position. If this context has *never* occurred before, then we have a literal. Otherwise, fetch a "literal/copy" bit from the compressed file.
  • (In either case, remember that Position, but update the context table to point to the current location in the decompressed text. The next time we have the same context, we get this location, rather than that less recent Position)
  • 0: literal: the decoder grabs the next byte from the compressed file and passes it straight through to the decompressed text.
  • 1: copy item: the decoder grabs a few bits from the compressed file and decodes them into a length. Starting at the Position where this context occurred before, copy "length" characters from that previously-decoded text to end of the decoded text. Then grabs the next literal byte from the compressed file and passes it straight through to the decompressed text.
  • Repeat from the beginning until there is no more items in the compressed file.


Some documents[30] describe a byte-oriented version of LZP:

  • Look up the current context in the context table to get the Position. (Remember that Position, but update the context table to point to the current location in the decompressed text. The next time we have the same context, we get this location, rather than that less recent Position).
  • Fetch the next byte from the compressed text.
  • If this context has *never* occurred before, *or* the fetched byte is not the escape byte, then we have a literal; otherwise, we have a copy item. (The compressor should have picked a value for the escape byte that occurs rarely in the plain text).
  • literal: pass it straight through to the decompressed text.
  • copy item: Discard the escape byte, and grab the length byte. Starting at the Position where this context occurred before, copy "length" characters from that previously-decoded text to end of the decoded text. Then grab the next literal byte from the compressed file and pass it straight through to the decompressed text, even if it happens to be the escape byte. (Occasionally the length will be 0, such as when the decompressed text *needs* to contain the special byte we use as the escape byte).
  • Repeat from the beginning until there are no more items in the compressed file.


Clipboard

To do:
Should this WikiBook describe LZ77 and LZP first, and then later describe ROLZ as a generalization between them? Yes - Charles came up with LZP first, and ROLZ was developed by Malcolm as a generalisation of LZP

Statistical Lempel ZivEdit

Statistical Lempel-Ziv was first proposed by Yu Fan Ho as the research topic of his Master degree in Computer Science, supervised by Dr. Sam Kwong. They first published it as an application to personal digital assistants.[31] Later Yu Fan Ho applied to it ring tones for mobile phones, where it outperformed other available lossless compressors on compression ratio and decompression speed. The application of statistical Lempel Ziv to melody data is patented.[32]

ACBEdit

Associative Coding by Buyanovsky (ACB) starts with an LZ77-like compressor, and does many things to squeeze out a lot of the redundancy in the (offset, length) items. [33]

Tunstall codeEdit

The Huffman algorithm gives a fixed-to-variable-length code that (under certain constraints) is optimal.

The Tunstall algorithm gives a variable-to-fixed-length code that (under certain constraints) is optimal.[34][35][36]

Implementation tips and tricksEdit

compressed format tips and tricksEdit

Most data compression decompression algorithms are most naturally described in terms of reading the compressed file as an undifferentiated stream of bits (a bitstream). Most programming languages do file I/O a byte at a time.

So decompression implementations typically keep a separate "bit buffer" (initially empty). When you want to read a bit, if the bit-buffer is empty, read a (byte-aligned) byte from the compressed bytestream, store the last not-yet-needed 7 bits in the bit-buffer, and use the first bit. Otherwise read the next bit from the bit-buffer.

Many file formats are designed to maintain "8 bit alignment" (some even try to maintain "16 bit alignment") wherever possible, in order to speed things up by reducing the amount of bit-shifting.

Often when decompressing, you've read some odd number of bits, and the abstract decompression algorithm mentions something like "read the next 8 bits and pass that literal byte directly to the output". There are three possible ways to design a file format to handle "read the next 8 bits":

  • "undifferentiated bitstream": get the remaining bits in the bit buffer, reload the bit buffer, and get a few more bits, then shift-and-align to re-assemble the byte.
  • Throw away the remaining "padding bits" in the bit buffer, and simply read the next (byte-aligned) byte from the compressed bytestream. Faster than the bitstream approach, but the unused bits hurt the compression ratio, and so we will speak no more of this approach.
  • "byte aligned": Temporarily ignore the bit-buffer; read the (byte-aligned) byte directly from the compressed bytestream. (Only use the bit-buffer when the decompression algorithm reads a bit or two or seven).

Compared to an undifferentiated bitstream format, a byte-aligned format for compressed text is precisely the same number of bytes long and has exactly the same bits; the bits are merely arranged in a slightly different order. So picking a byte-aligned format gives precisely the same compression ratio, and it often makes things faster.

compressor implementation tips and tricksEdit

hash tablesEdit

The obvious way to implement a LZ77-style (offset, length) compressor involves scanning the entire window for a match to the following plaintext, for each and every phrase we eventually write into the compressed file. Then picking the longest (and nearest) match.

The obvious way to implement a LZ78-style (dictionary entry) compressor involves scanning the entire dictionary for a match to the following plaintext, for each and every dictionary index we eventually write into the compressed file.

For a file of length N this is O(N) for each compressed phrase. This requires O(N^2) to compress a file of length N. (Schlemiel the Painter's algorithm). (I'm assuming the LZ77 window covers at least half of the file, and assuming a roughly constant compression ratio). This is excruciatingly slow.

Most LZ77-style and LZ78-style compressor implementations use a speedup trick: Whenever they output a compressed phrase, they do a little O(1) "extra" work updating a hash table. Then, instead of sequentially scanning the entire window for a match, they look up the following plaintext (typically the next 3 bytes) in the hash table. In a hash table, searches immediately succeed or fail in O(1), and therefore it only takes O(N) to compress the entire file. This is far faster.

Many LZ77-style compressors deliberately sacrifice the maximum possible compression in two ways: (a) Some compressors only update the hash table once per phrase, rather than once per every byte. (b) Some compressors use a direct-indexed cache -- a very simple hash table without collision detection -- the table remembers only the single most-recent item with that hash index -- rather than a standard hash table that remembers *every* item (so far) with that hash index. Typically each of (a) or (b) hurts compression by a few percent, but can roughly double or triple the speed of the compressor. Many LZ77 style compressors do both (a) and (b), hurting compression by a few percent, but getting roughly four or ten times the speed.

pseudo-RLEEdit

RLE can be emulated with overlapping copies in most LZ77-like decompressors ("pseudo-RLE mode"). The decompressor must use a method that handles such overlapping copies properly -- in particular, the decompressor must not use memcpy() or memmove() on such overlapping copies. On some machines, it's faster to simply use the same "copy one byte at a time" copy_bytes() routine for all copies. On other machines, programmers find that memcpy() is so much faster than that it reduces the total decompression time if the decompressor includes extra code that, for each copy item, decides whether it overlaps or not, and uses the slower copy_bytes() only when memcpy() won't work.

Clipboard

To do:
(Neither memcpy() nor memmove() are quite what we want here. See the comments of the "IncrementalCopy" function[17].)

ReferencesEdit

  1. J. Pike. "Text compression using a 4 bit coding scheme". 1981.
  2. https://github.com/antirez/smaz#readme
  3. a b c d Tomáš Kuthan and Jan Lánský "Genetic algorithms in syllable based text compression". 2007.
  4. LZJB source code
  5. Jacob Ziv and Abraham Lempel; A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, 23(3), pp.337-343, May 1977.
  6. "Better Archiver with Recursive Functionality (BARF)" by Matt Mahoney
  7. "replacing liblzf with FastLZ"
  8. Hadoop: "Implement FastLZCodec for fastlz/lzo algorithm"
  9. Wikipedia: Lempel–Ziv–Oberhumer
  10. http://www.ietf.org/ietf-ftp/IPR/hifn-ipr-draft-friend-tls-lzs-compression.txt
  11. Snappy in C++
  12. Snappy in pure Java
  13. PalmDoc Algorithm [1] [2]
  14. RedComet. "Dual-Tile Encoding: NES/Famicom Implementation". "Dual-Tile Encoding (commonly referred to as DTE in w:ROM hacking#Hex editing romhacking circles)"
  15. "short data compression"
  16. "Pattern-Based String Compression Function" by Frank Cox 2010
  17. "Alice’s bubbles (Part 2)"
  18. stackoverflow: optimizing byte-pair encoding
  19. Wikipedia: byte pair encoding
  20. Carl Kindman. "SCZ - Simple Compression Utilities and Library".
  21. a b c "Lempel-Ziv-Welch (LZW) Encoding: Discussion and Implementation" by Michael Dipperstein
  22. Wikipedia: Lempel–Ziv–Welch
  23. Wikipedia: LZWL
  24. Katsiaryna Chernik, Jan Lánský, and Leo Galamboš. "Syllable-based Compression for XML Documents".
  25. "Constructing Word-Based Text Compression Algorithms (1992)" by Nigel Horspool and Gordon Cormack.
  26. Juha Nieminen. "An efficient LZW implementation: Fast LZW dictionary search". 2007.
  27. "LZRW4: ZIV AND LEMPEL MEET MARKOV" by Ross Williams 1991
  28. "Anatomy of ROLZ data archiver"
  29. a b "Lzp" by Arturo San Emeterio Campos 1999
  30. http://mattmahoney.net/dc/text.html
  31. "A Statistical Lempel-Ziv compression algorithm for personal digital assistant (PDA)". IEEE Transactions on Consumer Electronics. 2001-Feb.
  32. Ho; Yu Fan. United States Patent 7,507,897. "Dictionary-based compression of melody data and compressor/decompressor for the same".
  33. Encode: "ACB" discussion
  34. Michael Drmota, Yuriy A. Reznik, and Wojciech Szpankowski. "Tunstall Code, Khodak Variations, and Random Walks". 2010. "Tunstall Code, Khodak Variations, and Random Walks" 2008.
  35. David Salomon. [http://books.google.com/books?id=ujnQogzx_2EC&pg=PA61&lpg=PA61&dq=tunstall+code&source=bl&ots=FolApG2roS&sig=J0d3UB0Y5xbikfzE9fYtLjBB3-E&hl=en#v=onepage&q=tunstall%20code&f=false "Data Compression. 2.4 Tunstall Code"] p. 61. 2007.
  36. "Signal Compression: Variable to Fix Encoding"
Last modified on 21 October 2012, at 01:52