The Invent with Python Blog

Fri 17 August 2012

How Does Compression Work?

Posted by Al Sweigart in misc   

"Zip" programs that can compress multiple files into one smaller .zip file are fairly popular for downloads since the fewer bytes you have to download the faster it will download. But how do you compress files? Files are made up of ones and zeros, which can't be squished like clothes into a tight suitcase.

Think about file compression working like this: If someone asked you what the lyrics to "99 bottles of beer on the wall" were, you would tell them:

"The song goes '99 bottles of beer on the wall, 99 bottles of beer, take one down, pass it around, 98 bottles of beer on the wall', and then you repeat the lyrics with a number one less than the previous time."

The above sentence has 42 words in it. You wouldn't tell the person literally the entire song (which has 2,376 words in it). The reason you don't have to recite the full song is because of the song's repetition. Repetition makes compression possible. We've achieved a 98% compression rate (counting by the word amount) with the description of the lyrics over the literal lyrics (42 words vs. 2,376 words), even though both detail exactly the same song.

Say you are sending a telegram that charges by the letter. You notice that the trigram "the" is very frequent, not only for the word "the" but for words like "there", "they", or even "mathematics". You also notice that you almost never use the letter Q. What you can do is substitute "q" for "the" like this: "Q TEXTBOOK FOR QIR MAQMATICS CLASS IS QN RETURNED TO Q LIBRARY." (which has 63 letters) as opposed to "THE TEXTBOOK FOR THEIR MATHEMATICS CLASS IS THEN RETURNED TO THE LIBRARY." (which has 73 letters).

Sometimes you may need a literal Q itself rather than as a "THE" shortcut. In that case, you can just add the convention of using QQ instead: "Q QQUEEN'S MAQMATICS TEXTBOOK." Using the double-q makes the message longer (as does added explanation of how Q's replace "the" and the double-q convention, called the compressed file's dictionary), but since Q's are so rare, the savings from using Q in place of "the" results in net compression of the length of the message.

(If you ever need to literally use "QQ", you can still use the double-q convention to make it "QQQQ". For example, "COME TO MY BBQQ. THE EXTRA Q IS A TYPO." would become "COME TO MY BBQQQQ. Q EXTRA QQ IS A TYPO.")

File compression works the same way. A compression program will search through a file and find long series of bytes that are repeatedly used and replace them with a single byte (or as short series of bytes as possible) that rarely occurs in the file.

Morse Code Has Built-In Compression

Morse code works the same way. English letters are encoded as a series of dots (i.e. short beeps) and dashes (i.e. long beeps), much like computer files are encoded with ones and zeros. But not all letters are the same length. Since E and T are very common in English, a single dot and dash (respectively) are used to encode them. Dot-dot is used for I, dot-dash for A, dash-dot for N, and dash-dash for M. The longer combinations of dots and dashes are reserved for the rare letters like Q and Z and J. This way it takes fewer dots or dashes to send an English message.

Why Can't You Compress a Compressed File?

Compression is possible because of repetition. The more repetition you have, the more you can compress. Our compressed form of the "99 bottles of beer" lyrics was 2% the size of the original, literal lyrics because there is a lot of repetition. But files that don't have a lot of repetition can't be compressed that much, if at all.

Random numbers are an example of incompressible data. Since there aren't that many repeated patterns in random numbers nor does any number come up much more frequently than the others, trying to compress it often won't achieve enough compression to make up for the dictionary that needs to be added to the file.

Compressing a file reduces the amount of repetition in the file, so a compressed file itself isn't very compressible. If you try to put a compressed zip file inside another compressed zip file, it will be larger in size than the original zip file (due to the dictionary being added to it).

Encryption algorithms attempt to make a file's content secret by making the file indistinguishable to meaningless random numbers (aka garbage data), so encrypted files are often not very compressible either.

English text has a lot of repetition in it, such as the common trigrams "the", "and", "tha", "ent", and so on. English text is usually very compressible.

For an example, I've created two files: one with the words "Don't panic!" repeated over and over again, and another with randomly generated bytes. Both of these uncompressed files are 50,000 bytes. When I compress both files into a zip file, the "Don't panic!" file is 276 bytes in size, while the random file is 53,248 (the "compressed" zip file is larger than the original file!)

A 50,000 byte file of the first part of the Frankenstein novel (which is a good sample of English text) compresses down to 20,755 bytes (it's not nearly as repetitive as the "Don't panic!" file, but still has a lot of repetition in it.)




Why Don't We Compress Everything?

If we can reduce the space a file takes up, why don't we compress everything all the time? The reason is because compressing and decompressing a file takes time. If you have to decompress a file each time a program wants to read it (and re-compress it whenever a program modifies a file), then your programs would run slower. Often this savings on disk space isn't worth the slow down.

So there is a trade-off between disk space savings and time. Some compression programs let you choose how hard they will work to find repeated patterns in the file. The WinRAR program lets you choose the "Best" compression method (which will offer the most disk space savings, but takes the longest amount of time) down to the "Fastest" compression method (which can compress the file quickly, but doesn't compress it that much.)

However, since transferring a file over the Internet is a lot slower than reading a file on your hard drive, the time savings from downloading a compressed file make up for the small amount of time you have to decompress the file after it is downloaded.

Why Do We Need More Than One Compression Algorithm? (re: Lossy and Lossless)

It seems like compression is pretty straightforward, so why do we need more than one compression algorithm? The reason is that different algorithms can work with different types of data. The Lempel-Ziv-Welch (LZW) algorithm is used for zip files gif images (zip files use the DEFLATE algorithm), and is an all-around good compression algorithm. It is a lossless compression algorithm, meaning that when you decompress a file compressed with LWZ, you get the exact same file as the one that was compressed. This seems obviously necessary, otherwise compressing a file will corrupt it and make it unusable.

However, there are times when we don't need perfect, lossless compression. Lossy compression can be used for things like images, video, and audio.

The compression algorithm used in JPEG images uses lossy compression. You don't necessarily need a photo to have every pixel be completely accurate, because most humans won't be able to tell the difference. So a compression algorithm that has a side effect of lessening the image quality (but not noticeably) can achieve greater compression.

Here is a JPEG image that has high quality (and is 89kb in size):

Here is a JPEG image that has low quality (but is a much smaller 7kb in size):

If you look closely, you can see the low-quality side effects of the heavy JPEG compression. These are called compression artifacts, and all lossy compression algorithms have them. But if the artifacts aren't too noticeable or a high level of quality isn't needed, then you can achieve a large amount of compression.

MP3 files are also compressed. The MP3 compression algorithm, like JPEGs, can vary in trade-off between file size and quality. Larger MP3 files have a high bit rate and high sound quality. MP3 files with a low bit rate have lower quality (at the lowest quality, it sounds like a song is being played over a phone) but the file sizes are smaller.

Learning Compression Algorithms

Most programming languages have different compression algorithms already implemented in modules and libraries. Python comes with several modules for compression, such as the zlib, gzip, and zipfile modules. (The Python Module of the Week tutorial for zipfile is pretty good.)

If you want to learn how to implement these compression algorithms yourself, you should probably start with the relatively simple Huffman coding algorithm. Here and here are some example Python code for Huffman coding.

I hope this clears up the questions you have about file compression!