Data Compression/Executable Compression

executable software compression

edit

In this chapter, we are careful to distinguish between 3 kinds of software:

  • "data compression software", the implementation of algorithms designed to compress and decompress various kinds of files -- this is what we referred to as simply "the software" in previous chapters.
  • "executable files" -- files on the disk that are not merely some kind of data to be viewed or edited by some other program, but are the programs themselves -- text editors, word processors, image viewers, music players, web browsers, compilers, interpreters, etc.
  • "source code" -- human-readable files on the disk that are intended to be fed into a compiler (to produce executable files) or fed into an interpreter.

Executable files are, in some ways, similar to English text -- and source code is even more similar -- and so data compression software that was designed and intended to be used on English text files often works fairly well with executable files and source code.

However, some techniques and concepts that apply to executable software compression that don't really make sense with any other kind of file, such as:

  • self-decompressing executables, including boot-time kernel decompression[1][2][3][4]
  • dead-code elimination and unreachable code elimination (has some analogy to lossy compression)
  • refactoring redundant code, and the opposite process: inlining (has some analogy to dictionary compression)
  • some kinds of code size reduction may locally make a subroutine, when measured in isolation, appear run slower, but improve the net performance of a system. In particular, loop unrolling, inlining, "complete jump tables" vs "sparse tests", and static linking, all may make a subroutine appear -- when measured in isolation -- to run faster, but may increase instruction cache misses, TLB cache misses, and virtual memory paging activity enough to reduce the net performance of the whole system.[5]
    • procedural abstraction[6]
    • cross-jumping, also called tail merging[6]
  • pcode and threaded code
  • compressed abstract syntax trees are, as of 1998, the most dense known executable format, and yet execute faster than bytecode.[7]
  • JavaScript minifiers (sometimes called "JavaScript compression tools")[8] convert JavaScript source into "minified" JavaScript source that runs the same, but is smaller. CSS minifiers and HTML minifiers work similarly.[9]
    • Simple minifiers strip out comments and unnecessary whitespace.
    • Closure Compiler does more aggressive dead-code elimination and single-use-function inlining.
  • code compression[10]
  • run-time decompression[11]
  • demand paging, also called lazy loading, a way to reduce RAM usage
  • shared library
  • PIC shared library and other techniques for squeezing a Linux distribution onto a single floppy[12][13] or a single floppy X Window System thin client.[14]
  • copy-on-write and other technologies for reducing RAM usage[15]
  • various technologies for reducing disk storage, such as storing only the source code (possibly compressed) and a just-in-time in-memory compiler like Wikipedia: Tiny C Compiler (or an interpreter), rather than storing only the native executable or both the source code and the executable.
  • Selecting a machine language or higher-level language with high code density[16]
  • various ways to "optimize for space" (including the "-Os" compiler option)[17]
  • using newlib, uClibc, or sglibc instead of glibc[17][18][19]
  • code compression for reducing power[20]
  • Multiple "levels" or "waves" of unpacking: a file starts with a very small (machine-language) decompressor -- but instead of decompressing the rest of the file directly into machine language, the decompressor decompresses the rest of the file into a large, sophisticated decompressor (or interpreter or JIT compiler) (in machine language) and further data; then the large decompressor (or interpreter or JIT compiler) converts the rest of the file into machine language.[21]

compress then compile vs compile then compress

edit

Which stage of compilation gives the best compression:

    • compress the final machine-specific binary executable code?
    • compress the original machine-independent text source code? For example, JavaScript minifiers.
    • compress some partially compiled machine-independent intermediate code? For example, "Slim Binaries" or the "JAR format"[22]

Some very preliminary early experiments[23] give the surprising result that compressed high-level source code is about the same size as compressed executable machine code, but compressing a partially-compiled intermediate representation gives a larger file than either one.

filtering

edit

Many data compression algorithms use "filter" or "preprocess" or "decorrelate" raw data before feeding it into an entropy coder. Filters for images and video typically have a geometric interpretation. Filters specialized for executable software include:

  • "detect "CALL" instructions, converting their operands from relative addressing to absolute addressing, thus calls to the same location resulted in repeated strings that the compressor could match, improving compression of 80x86 binary code." (Wikipedia: LZX (algorithm)#Microsoft Cabinet files)
  • recoding branches into a PC-relative form[6]
  • Instead of decompressing the latest version of an application in isolation, starting from nothing, start from the old version of an application, and patch it up until it is identical to the new latest version. That enables much smaller update files that contain only the patches -- the differences between the old version of an application and the latest version. This can be seen as a very specific kind of data differencing.
    • The algorithm used by BSDiff 4 uses using suffix sorting to build relatively short patch files[24]
    • Colin Percival, for his doctoral thesis, has developed an even more sophisticated algorithm for building short patch files for executable files.[24]
  • "disassemble" the code, converting all absolute addresses and offsets into symbols; then patch the disassembled code; then "reassemble" the patched code. This makes the compressed update files for converting the old version of an application to the new version of an application much smaller.[25]

Several programmers believe that a hastily written program will be at least 10 times as large as it "needs" to be. [26] [27]

A few programmers believe that 64 lines of source code is more than adequate for many useful tools.[28]

The aim of the STEPS project is "to reduce the amount of code needed to make systems by a factor of 100, 1000, 10,000, or more." [29]


Most other applications of compression -- and even most of these executable compression techniques -- are intended to give results that appear the same to human users, while improving things in the "back end" that most users don't notice. However, some of these program compression ideas (refactoring, shared libraries, using higher-level languages, using domain-specific languages, etc.) reduce the amount of source that a human must read to understand a program, resulting in a significantly different experience for some people (programmers). That time savings can lead to significant cost reduction.[30] Such "compressed" source code is arguably better than the original; in contrast to image compression and other fields where compression gives, at best, something identical to the original, and often much worse.

References

edit
  1. Embedded Linux wiki: "Fast Kernel Decompression"
  2. Smallcode: "Self-extracting executables " by Peter Kankowski, with a wiki discussion at strchr: "creating_self-extracting_executables"
  3. UPX is a portable executable packer for several different executable formats including linux/elf386, vmlinuz/386 and win32/pe.
  4. ficl "Ficl ... LZ77 compressed ... with the runtime decompressor, the resulting Ficl executable is over 13k smaller"
  5. "Managing code size"
  6. a b c "Enhanced Code Compression for Embedded RISC Processors" by Keith D. Cooper and Nathaniel McIntosh
  7. "Java: Trees Versus Bytes" thesis by Kade Hansson. "A Compact Representation For Trees ... is the most dense executable form that he knows of. ... The creation of compact tree representations ... achieving better compression and developing faster codecs will no doubt be a growth area of research in coming years. One such project in this area may lead to an eventual replacement for the Java class file format."
  8. Javascript Compression tools
  9. w: minification (programming)
  10. "Code compression under the microscope" by Jim Turley 2004
  11. Charles Lefurgy, Eva Piccininni, and Trevor Mudge. "Reducing Code Size with Run-time Decompression".
  12. "Brian Writes about His BOEL (Part 1)" (a Linux distribution that fits on a single floppy)
  13. "BOEL, Part 2: Kernel Configuration and Booting"
  14. "2-Disk Xwindow embedded Linux": "a single floppy X Window System thin client."
  15. embedded Linux wiki: "Technologies for decreasing system size"
  16. Microprocessor Design/Code Density
  17. a b "Optimizing for Space : Measurements and Possibilities for Improvement" by Árpád Beszédes, Tamás Gergely, Tibor Gyimóthy, Gábor Lóki, and László Vidács 2003 from the GCC ARM Improvement Project at the University of Szeged
  18. uClibc FAQ
  19. "Embedding with GNU: Newlib" by Bill Gatliff 2001; "Embedding GNU: Newlib, Part 2" by Bill Gatliff 2002
  20. "COMPASS - A tool for evaluation of compression strategies for embedded processors" by Menon and Shankar, which cites "A Class of Code Compression Schemes for Reducing Power Consumption in Embedded Microprocessor Systems" by Benini, Menichelli, and Olivieri
  21. "A new visualization for packed and self-modifying programs".
  22. "Slim Binaries"
  23. "Code Size When Compiling to JavaScript" http://mozakai.blogspot.com/2011/11/code-size-when-compiling-to-javascript.html
  24. a b Colin Percival, Naive differences of executable code, http://www.daemonology.net/bsdiff/, 2003. [http://www.daemonology.net/bsdiff/
  25. "How Courgette works" by Stephen Adams 2009 at The Chromium Projects.
  26. "Thoughtful Programming and Forth" by Jeff Fox
  27. "1x Forth" by Charles Moore 1999
  28. "useful source code in 64 lines or less"
  29. "STEPS Toward The Reinvention of Programming" 2007 apparently (?) written by Alan Kay, Ted Kaehler, Ian Piumarta, Yoshiki Ohshima, Alex Warth, Takashi Yamamiya, Dan Amelang, Scott Wallace, Dan Ingalls, Andreas Raab, Jim Clune, John Maloney, Dave Smith, Kim Rose, Chuck Thacker, Vishal Sikka, and David Reed.
  30. "Code Reduction is Job #1"; "Should Code Reduction be Job #1?"