Data Coding Theory/Data Compression

Data Coding Theory

Data Compression

When transmitting digital data, we find that frequently we can't send our information as quickly as we would like. Specific limits, such as Shannon's Channel Capacity, restrict the amount of digital information that can be transmitted over a given channel. Therefore, if we want to transmit more data, we need to find a way to make our data smaller.

To make data smaller, we implement one of a number of techniques known as data compression. There are a number of different compression algorithms, but they can all be broken down into two categories: Lossless algorithms, and lossy algorithms.

The fundamental function of a compression is to remove redundancy, where redundancy is all that could be removed or expressed in a different way, whilst not removing it's meaning.

Lossy Algorithms

Lossy algorithms are techniques that can be used to transmit a "pretty good" version of the data. With a lossy algorithm, there is always going to be a certain amount of data lost in the conversion. Lossy algorithms provide much higher compression rates then lossless algorithms, but the downfall is that information must be lost to attain those high rates. An example of a lossy compression algorithm is .jpeg image files, or mp3 music files. If the lossy algorithm is good enough, the loss might not be noticeable by the recipient.

Lossless Algorithms

Lossless algorithms decrease the size of a given signal, while at the same time not losing any information from the original. For this reason, Lossless compression algorithms are preferable to lossy algorithms, especially when the data needs to arrive at the recipient intact. Examples of lossless compression algorithms are ZIP files, and GIF images.

Run-Length Encoding

Run-length encoding (RLE) is probably one of the best known compression techniques. Here is how it works: Let's assume we have some input data: 'aaaabbbc' Now RLE compresses this by expressing the amount of times each symbol occurs, if it occurs more than 2 times. This 'compresses' to '4a3bc' which means as much as 4 x a, 3 x b, 1 x c. We don't express a data item explicitly if it occurs twice or just once. We would just lose space.

Huffman Coding

Huffman coding is a very powerful compression technique that can be used as an optimal lossless encoding technique. It tries to assign most recurring words(data) with fewer number bits and least recurring words with a greater number of bits.

Huffman at wikipedia: w:Huffman coding

Variable-Length Encoding

Notes

Data compression, while a related field to coding theory, is not strictly in the scope of this book, and so we will not cover it any further here.