How Does Compression Work?

“Zip” programs that can compress multiple files into one smaller .zip file are fairly popular for downloads since the fewer bytes you have to download the faster it will download. But how do you compress files? Files are made up of ones and zeros, which can’t be squished like clothes into a tight suitcase.

Think about file compression working like this: If someone asked you what the lyrics to “99 bottles of beer on the wall” were, you would tell them:

“The song goes ’99 bottles of beer on the wall, 99 bottles of beer, take one down, pass it around, 98 bottles of beer on the wall’, and then you repeat the lyrics with a number one less than the previous time.”

The above sentence has 42 words in it. You wouldn’t tell the person literally the entire song (which has 2,376 words in it). The reason you don’t have to recite the full song is because of the song’s repetition. Repetition makes compression possible. We’ve achieved a 98% compression rate (counting by the word amount) with the description of the lyrics over the literal lyrics (42 words vs. 2,376 words), even though both detail exactly the same song.

Say you are sending a telegram that charges by the letter. You notice that the trigram “the” is very frequent, not only for the word “the” but for words like “there”, “they”, or even “mathematics”. You also notice that you almost never use the letter Q. What you can do is substitute “q” for “the” like this: “Q TEXTBOOK FOR QIR MAQMATICS CLASS IS QN RETURNED TO Q LIBRARY.” (which has 63 letters) as opposed to “THE TEXTBOOK FOR THEIR MATHEMATICS CLASS IS THEN RETURNED TO THE LIBRARY.” (which has 73 letters).

Sometimes you may need a literal Q itself rather than as a “THE” shortcut. In that case, you can just add the convention of using QQ instead: “Q QQUEEN’S MAQMATICS TEXTBOOK.” Using the double-q makes the message longer (as does added explanation of how Q’s replace “the” and the double-q convention, called the compressed file’s dictionary), but since Q’s are so rare, the savings from using Q in place of “the” results in net compression of the length of the message.

(If you ever need to literally use “QQ”, you can still use the double-q convention to make it “QQQQ”. For example, “COME TO MY BBQQ. THE EXTRA Q IS A TYPO.” would become “COME TO MY BBQQQQ. Q EXTRA QQ IS A TYPO.”)

File compression works the same way. A compression program will search through a file and find long series of bytes that are repeatedly used and replace them with a single byte (or as short series of bytes as possible) that rarely occurs in the file.

Morse Code Has Built-In Compression

Morse code works the same way. English letters are encoded as a series of dots (i.e. short beeps) and dashes (i.e. long beeps), much like computer files are encoded with ones and zeros. But not all letters are the same length. Since E and T are very common in English, a single dot and dash (respectively) are used to encode them. Dot-dot is used for I, dot-dash for A, dash-dot for N, and dash-dash for M. The longer combinations of dots and dashes are reserved for the rare letters like Q and Z and J. This way it takes fewer dots or dashes to send an English message.

Why Can’t You Compress a Compressed File?

Compression is possible because of repetition. The more repetition you have, the more you can compress. Our compressed form of the “99 bottles of beer” lyrics was 2% the size of the original, literal lyrics because there is a lot of repetition. But files that don’t have a lot of repetition can’t be compressed that much, if at all.

Random numbers are an example of incompressible data. Since there aren’t that many repeated patterns in random numbers nor does any number come up much more frequently than the others, trying to compress it often won’t achieve enough compression to make up for the dictionary that needs to be added to the file.

Compressing a file reduces the amount of repetition in the file, so a compressed file itself isn’t very compressible. If you try to put a compressed zip file inside another compressed zip file, it will be larger in size than the original zip file (due to the dictionary being added to it).

Page 1 of 3 | Next page