Data Compression/References

Benchmark files

The Canterbury Corpus (1997) is the main benchmark for comparing compression methods. Of these 11 files, the largest is roughly 1 MByte. That web page also links to a few other test files which are useful for debugging common errors in compression algorithms.
The Silesia Corpus (2003) contains files between 6 MB and 51 MB. The 12 files includes two medical images, the SAO star catalog, a few executable files, etc.
Matt Mahoney has published a large benchmark text file used in the "Large Text Compression Benchmark"
- Large Text Compression Benchmark is a file "enwik9"(1,000,000,000 bytes), the first 10^9 bytes from the English Wikipedia dump on Mar. 3, 2006.
- The Hutter Prize involves compressing a file "enwik8"(100,000,000), the first 10^8 bytes of enwik9, ultimately from Wikipedia.
A large-file text compression corpus, maintained by Andrew Tridgell, is oriented towards relatively large, highly redundant files. It contains 5 files between 27 MB and 85 MB (uncompressed), mostly English text and Lisp, assembly, and C source code. It helps test (implementations of) compression algorithms designed to detect and compress very long-range redundancies, such as lzip[2] and rzip[3].
"The Calgary corpus"[4][5] is a series of 14 files, most of them ASCII text, and was the de facto standard for comparing lossless compressors before the Canterbury Corpus.
- "The Calgary corpus Compression & SHA-1 crack Challenge" (formerly known as the "The Calgary Corpus Compression challenge") by Leonid A. Broukhis, has paid out several prizes around $100 each for "significantly better" compression of all 14 files in the Calgary Corpus.
"The Data Compression News Blog" edited by Sachin Garg. Sachin Garg has also published benchmark images and image compression benchmark results.
Lasse Collin uses open-source software in his executable compression benchmark.
Elephants Dream: Original lossless video and audio available: Matt suggests "It would be great to see Elephants Dream become the new standard source footage for video and audio compression testing!"
Alex Ratushnyak maintains the Lossless Photo Compression Benchmark.
"Xiph.org Video Test Media (derf's collection)" -- it includes the "SVT High Definition Multi Format Test Set".
Waterloo BragZone repository (Where?) (Some (all?) of its images are available at http://links.uwaterloo.ca/Repository.html )

To do:
Are there any benchmarks for evaluating Wikipedia: differential compression?

To do:
Should we have a list of desired features for future benchmark sets, something like "Some Data Compression Corpora We Need Badly" ?

open-source example code

Most creators of data compression algorithms tend to release them with open-source implementations (mostly BSD compatible licenses, not GPL). The benefits of being open-source, is that is serves as a open review and a call for participation, making it much easier to evolve the algorithm by combining ideas from many sources (even more due to an open license compatibility). Also by being open-source, the algorithm can rapidly be adopted and gain markets share and predominance even archiving standardization in itself and/or in niche implementations, this of course can also be the reason why some are still closed source, especially when they provide a clear commercial advantage over competitors (commercial or not).

To do:
Should we link to good, open-source, well-commented implementation of, for example, LZW *here*, or in the section of the book that discusses LZW ?

To do:
Point or provide some basic information regarding the distinction of opting for the GPL or MIT/BSD licenses. I made the relevant distinction but did not explain it... Note that the new GPL now also has implication regarding patents.

Compression Interface Standard by Ross Williams. Is there a better interface standard for compression algorithms?

jvm-compressor-benchmark is a benchmark suite for comparing the time and space performance of open-source compression codecs on the JVM platform. It currently includes the Canterbury corpus and a few other benchmark file sets, and compares LZF, Snappy, LZO-java, gzip, bzip2, and a few other codecs. (Is the API used by the jvm-compressor-benchmark to talk to these codecs a good interface standard for compression algorithms?)

inikep has put together a benchmark for comparing the time and space performance of open-source compression codecs that can be compiled with C++. It currently includes 100 MB of benchmark files (bmp, dct_coeffs, english_dic, ENWIK, exe, etc.), and compares snappy, lzrw1-a, fastlz, tornado, lzo, and a few other codecs.

"Compression the easy way" simple C/C++ implementation of LZW (variable bit length LZW implementation) in one .h file and one .c file, no dependencies.
BALZ by Ilia Muraviev - the first open-source implementation of ROLZ compression^[1]
QUAD - an open-source ROLZ-based compressor from Ilia Muraviev
LZ4 "the world's fastest compression library" (BSD license)
QuickLZ "the world's fastest compression library" (GPL and commercial licenses)
FastLZ "free, open-source, portable real-time compression library" (MIT license)
The .xz file format (one of the compressed file formats supported by 7-Zip and LZMA SDK) supports "Multiple filters (algorithms): ... Developers can use a developer-specific filter ID space for experimental filters." and "Filter chaining: Up to four filters can be chained, which is very similar to piping on the UN*X command line."

"Puff -- A Simple Inflate" by Mark Adler. Written to be very easy to read to help understand the deflate data format. Uses less RAM and code size than zlib.

libarchive (win32 LibArchive): library for reading and writing streaming archives. The bsdtar archiving program is based on LibArchive. Libarchive is highly modular. "designed ... to make it relatively easy to add new archive formats and compression algorithms". LibArchive can read and write (including compression and decompression) archive files in a variety of archive formats including ".tgz" and ".zip" formats. BSD license. libarchive WishList.

WebP is a new image format that provides lossless and lossy compression for images on the web. "WebP lossless images are 26% smaller in size compared to PNGs. WebP lossy images are 25-34% smaller in size compared to JPEG images at equivalent SSIM index." WebP is apparently the *only* format supported by web browsers that supports both lossy compression and an alpha channel in the same image. When the experimental "data compression proxy" is enabled in Chrome for Android, all images are transcoded to WebP format.^[2] BSD license.

VP8 and WebM video compression ...

The Ogg container format, often containing compressed audio in Vorbis, Speex, or FLAC format, and sometimes containing compressed video in Theora or Dirac format, etc.

libPFG, library for reading and writing files in Progressive Graphics File "PGF" format. Uses fast wavelet transform; lossless and lossy compression. Supports alpha transparency. LGPL.

OptiPNG: Advanced PNG Optimizer. zlib license.
- PNGtech: Technical articles on the PNG file format and the related compression technologies.

Guide to Unix/Commands/File Compression has some practical information on how to use data compression
Fedora And Red Hat System Administration/Archives And Compression has some practical information on how to use compression
JPEG - Idea and Practice has more detailed information on the specific details of how compression techniques are applied to JPEG image compression.
Data Coding Theory/Data Compression
Kdenlive/Video codecs briefly mentions the most popular video codecs
Movie Making Manual/Post-production/Video codecs discusses the most popular video codecs used in making movies and video, in a little more detail.
Movie Making Manual/Cinematography/Cameras and Formats/Table of Formats lists the most popular compressed and uncompressed video formats
Probability
The hydrogenaudio wiki has a comparison of popular lossless audio compression codecs.
a data compression wiki
a data compression wiki

non-wiki resources

"comp.compression" newsgroup
- "Comp.compression FAQ"
- comp.compression Frequently Asked Questions by Jean-loup Gailly 1999. (is there a more recent FAQ???)
http://data-compression.info/ has some information on several compression algorithms, several "data compression corpora" (benchmark files for data compression), and the results from running a variety of data compression programs on those benchmarks (measuring compressed size, compression time, and decompression time).
"Data Compression Explained" by Matt Mahoney. It discusses many things neglected in most other discussions of data compression. Such as the practical features of a typical archive format (the stuff in the thin wrapper around your precious compressed data), the close relation between data compression and artificial intelligence, etc.
Mark Nelson writes about data compression
- Mark Nelson and Jean-loup Gailly. "The Data Compression Book". 1995. ISBN 1-55851-434-1.
the Encode's Forum claims to be "probably the biggest forum about the data compression software and algorithms on the web".
"The LZW controversy" by Stuart Caie. (LZ78, LZW, GIF, PNG, Unisys, patents, etc.)
"Understanding gzip" by Zachary Vance (za3k). A very detailed, bit-by-bit analysis of three gzip files (and the deflate data format).

Executable Compression

Data Compression
References

TOC

[1] "Anatomy of ROLZ data archiver"

[2] [1]

[1]

[2]

Data Compression/References

Contents

Benchmark files

open-source example code

Further reading

non-wiki resources