How Does Compression Work?
MP3 files are also compressed. The MP3 compression algorithm, like JPEGs, can vary in trade-off between file size and quality. Larger MP3 files have a high bit rate and high sound quality. MP3 files with a low bit rate have lower quality (at the lowest quality, it sounds like a song is being played over a phone) but the file sizes are smaller.
Learning Compression Algorithms
Most programming languages have different compression algorithms already implemented in modules and libraries. Python comes with several modules for compression, such as the zlib, gzip, and zipfile modules. (The Python Module of the Week tutorial for zipfile is pretty good.)
If you want to learn how to implement these compression algorithms yourself, you should probably start with the relatively simple Huffman coding algorithm. Here and here are some example Python code for Huffman coding.
I hope this clears up the questions you have about file compression!
Jeremy Lawler:
August 17th, 2012 at 8:42 pm
The only thing I would add is that you can compress a compressed file and get advantageous results. This is only true with extremely repetitive data and is dependent on the algorithm. This is usually only useful with things like log files. Just as a proof of concept, I took a massive postgresql log I had, and gzipped it twice. It was originally 105MB, was compressed to 30MB, and then again to 27MB.
Obviously, the the second compression isn’t as useful as the first, but it is worth noting that it is still useful.
Jay:
August 17th, 2012 at 9:13 pm
Nice article!
I just want to point out that, while this article presents the basics of compression well, in certain applications it IS possible to compress files more than once since the methods of one compression library may compliment the methods of another library, resulting in a file that is much smaller than one that has only been compressed once. There’s an interesting discussion about it over here https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ%5B1-25%5D
Jay:
August 17th, 2012 at 9:16 pm
sorry, bad link in the last post
fixed: https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ
Daniel:
August 17th, 2012 at 11:22 pm
Really great explanation on a topic that can become easily confusing.
Felice Pollano:
August 18th, 2012 at 5:41 am
You did not mention fractal compression http://en.wikipedia.org/wiki/Fractal_compression very interesting a a lossy but very effcient algorithm
Comet:
September 12th, 2012 at 9:25 am
’0 bottles of beer on the wall, 0 bottles of beer, take one down, pass it around, -1 bottles of beer on the wall’, and then you repeat the lyrics with a number one less than the previous time–infinite compression. :P
Govind Lahoti:
October 8th, 2012 at 1:47 am
Wow,great explanation.
Thank you AI
you are world best programmer :)
Karl Geiger:
November 12th, 2012 at 11:02 am
Lovely explanation using the “99 Bottles of Beer on the Wall” song.
Many think of compression as something to do to data. “99 Bottles” is a good example of algorithmic compression, wherein the program itself encodes the information and is the smallest (most compressed) way to represent the data.
Ross Patterson:
December 11th, 2012 at 6:52 am
You write “English text has a lot of repetition in it, such as the common trigrams “the”, “and”, “tha”, “ent”, and so on. English text is usually very compressible.”, which is certainly correct. But English text isn’t very compressible because is highly repetitive, rather it’s that text (specifically ASCII text) is highly wasteful. ASCII text uses only 37% of the possible values of every byte, so any tolerable compression algorithm ought to at least get close to that reduction in size.
Any decent Huffman-encoding or run-length-encoding (e.g., LZW) scheme will pick up on the common multi-character groups and compress even tighter, but the bulk of the win comes just from the fact that 8-bit ASCII is inefficient.