Hacking NEC PC Games/Dealing With Compressed Data

Game data may be compressed for a couple reasons. The first reason is storage space: compressed data is smaller which means more and better content fits on the storage media. The second reason is speed. Loading from disk tends to take a while, and the less loading that is done the better things go, especially if the decompression can be done faster than the loading. Disk loads are also somewhat dangerous... it's not uncommon for floppy drives to "eat" diskettes, so the less of it is done the better.

There were a few methods for compressing data known in late 80s and early 90s.

  • LZ encoding
  • Huffman encoding
  • Run-length encoding

Run Length Encoding

edit

Run length encoding (RLE) is very simple: recognize a sequence of repetitive data and replace it with a mark denoting the repetition followed by the number of repetitions. It is the fastest of all compression methods, easy to understand, and is often used to encode level data and sprites.

LZ Encoding

edit

Run-length encoding has its applications, but in many cases it isn't particularly useful. How many words are repeated one after another in a book, for example? In such cases where data tends to be repeated in media but not consecutively, a dictionary approach may be effective.

In dictionary compression, every unique bit of data featured in a work of content is given its own number and added to a list. The data in the work is replaced with these numeric references and the dictionary is tacked on to the beginning. In a given book it is highly unlikely that there will be more than 65,000 different words, meaning every word can be a 2-byte reference to the dictionary. Assuming words are 6 or 7 letters on average, this can mean a savings of 70% if the work is long.

LZ encoding is dictionary compression with a twist. Instead of a single permanent dictionary at the beginning of the work, the LZ method creates a dictionary which "evolves" by only populating itself with the most recent data processed. The dictionary "remembers" the data it most recently encountered, up to a point, and discards its older data. To understand how this is useful, consider a novel where a certain character appears for only chapter. Why should the decompressor remember this character when decoding later chapters? The character's name and unique attributes could be forgotten and thus, no longer stored in memory. LZ encoding strikes a balance between dictionary size and compression ratio, which is usually in the neighborhood of 50%. For dithered bitmap images in particular, the LZ method is highly effective.