How Does Compression Work?

MP3 files are also compressed. The MP3 compression algorithm, like JPEGs, can vary in trade-off between file size and quality. Larger MP3 files have a high bit rate and high sound quality. MP3 files with a low bit rate have lower quality (at the lowest quality, it sounds like a song is being played over a phone) but the file sizes are smaller.

Learning Compression Algorithms

Most programming languages have different compression algorithms already implemented in modules and libraries. Python comes with several modules for compression, such as the zlib, gzip, and zipfile modules. (The Python Module of the Week tutorial for zipfile is pretty good.)

If you want to learn how to implement these compression algorithms yourself, you should probably start with the relatively simple Huffman coding algorithm. Here and here are some example Python code for Huffman coding.

I hope this clears up the questions you have about file compression!

Page 3 of 3 | Previous page

9 comments on this post.
  1. Jeremy Lawler:

    The only thing I would add is that you can compress a compressed file and get advantageous results. This is only true with extremely repetitive data and is dependent on the algorithm. This is usually only useful with things like log files. Just as a proof of concept, I took a massive postgresql log I had, and gzipped it twice. It was originally 105MB, was compressed to 30MB, and then again to 27MB.

    Obviously, the the second compression isn’t as useful as the first, but it is worth noting that it is still useful.

  2. Jay:

    Nice article!
    I just want to point out that, while this article presents the basics of compression well, in certain applications it IS possible to compress files more than once since the methods of one compression library may compliment the methods of another library, resulting in a file that is much smaller than one that has only been compressed once. There’s an interesting discussion about it over here https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ%5B1-25%5D

  3. Jay:

    sorry, bad link in the last post
    fixed: https://groups.google.com/forum/#!msg/lz4c/DcN5SgFywwk/AVMOPri0O3gJ

  4. Daniel:

    Really great explanation on a topic that can become easily confusing.

  5. Felice Pollano:

    You did not mention fractal compression http://en.wikipedia.org/wiki/Fractal_compression very interesting a a lossy but very effcient algorithm

  6. Comet:

    ’0 bottles of beer on the wall, 0 bottles of beer, take one down, pass it around, -1 bottles of beer on the wall’, and then you repeat the lyrics with a number one less than the previous time–infinite compression. :P

  7. Govind Lahoti:

    Wow,great explanation.
    Thank you AI
    you are world best programmer :)

  8. Karl Geiger:

    Lovely explanation using the “99 Bottles of Beer on the Wall” song.

    Many think of compression as something to do to data. “99 Bottles” is a good example of algorithmic compression, wherein the program itself encodes the information and is the smallest (most compressed) way to represent the data.

  9. Ross Patterson:

    You write “English text has a lot of repetition in it, such as the common trigrams “the”, “and”, “tha”, “ent”, and so on. English text is usually very compressible.”, which is certainly correct. But English text isn’t very compressible because is highly repetitive, rather it’s that text (specifically ASCII text) is highly wasteful. ASCII text uses only 37% of the possible values of every byte, so any tolerable compression algorithm ought to at least get close to that reduction in size.

    Any decent Huffman-encoding or run-length-encoding (e.g., LZW) scheme will pick up on the common multi-character groups and compress even tighter, but the bulk of the win comes just from the fact that 8-bit ASCII is inefficient.

Leave a comment