How Does Compression Work?

Encryption algorithms attempt to make a file’s content secret by making the file indistinguishable to meaningless random numbers (aka garbage data), so encrypted files are often not very compressible either.

English text has a lot of repetition in it, such as the common trigrams “the”, “and”, “tha”, “ent”, and so on. English text is usually very compressible.

For an example, I’ve created two files: one with the words “Don’t panic!” repeated over and over again, and another with randomly generated bytes. Both of these uncompressed files are 50,000 bytes. When I compress both files into a zip file, the “Don’t panic!” file is 276 bytes in size, while the random file is 53,248 (the “compressed” zip file is larger than the original file!)

A 50,000 byte file of the first part of the Frankenstein novel (which is a good sample of English text) compresses down to 20,755 bytes (it’s not nearly as repetitive as the “Don’t panic!” file, but still has a lot of repetition in it.)

Download compressed_dontpanic.zip

Download compressed_random.zip

Download compressed_frankenstein.zip

Why Don’t We Compress Everything?

If we can reduce the space a file takes up, why don’t we compress everything all the time? The reason is because compressing and decompressing a file takes time. If you have to decompress a file each time a program wants to read it (and re-compress it whenever a program modifies a file), then your programs would run slower. Often this savings on disk space isn’t worth the slow down.

So there is a trade-off between disk space savings and time. Some compression programs let you choose how hard they will work to find repeated patterns in the file. The WinRAR program lets you choose the “Best” compression method (which will offer the most disk space savings, but takes the longest amount of time) down to the “Fastest” compression method (which can compress the file quickly, but doesn’t compress it that much.)

lossless compression algorithm, meaning that when you decompress a file compressed with LWZ, you get the exact same file as the one that was compressed. This seems obviously necessary, otherwise compressing a file will corrupt it and make it unusable.

However, there are times when we don’t need perfect, lossless compression. Lossy compression can be used for things like images, video, and audio.

The compression algorithm used in JPEG images uses lossy compression. You don’t necessarily need a photo to have every pixel be completely accurate, because most humans won’t be able to tell the difference. So a compression algorithm that has a side effect of lessening the image quality (but not noticeably) can achieve greater compression.

Here is a JPEG image that has high quality (and is 89kb in size):

Here is a JPEG image that has low quality (but is a much smaller 7kb in size):

If you look closely, you can see the low-quality side effects of the heavy JPEG compression. These are called compression artifacts, and all lossy compression algorithms have them. But if the artifacts aren’t too noticeable or a high level of quality isn’t needed, then you can achieve a large amount of compression.

Page 2 of 3 | Previous page | Next page