from Guide to Machine Learning on Feb 3, 2024
How lossless compression works
There are all sorts of compressed files floating around the web. Most of these take on familiar file extensions, such as .zip
or .gz
. However, how do these compressed formats actually work?
Formalities: Compress vs. Archive, Lossless vs. Lossy
Before we discuss compressors in detail though, we need to distinguish compressors from archivers.
 Compressors reduce the storage size for a single file.

Archivers reduce the storage size for a group of files.
 Archivers simply bundle files together, then run a thirdparty compressor on the bundle. Technically speaking then, archivers don't reduce storage size on their own.
 For example, .tar files fall in this category, but since tar on its own isn't compressing the data, you'll often find a second file extension after tar, such as
.tar.gz
, indicating tar was used to archive, and gzip was used to compress.
We're going to focus on the first one — compressors. Within compression algorithms, there are mainly two types:
 Lossless compression algorithms exactly reconstruct the original data without any loss in quality.
 Lossy compression algorithms sacrifice some fidelity in the reconstruction. For images for example, we may sacrifice reconstruction accuracy if we can compress further while preserving perceived visual quality.
Again, we're going to focus on the first one — lossless compression. Specifically, in this post, we'll discuss how compression algorithms work in popular tools such as zip, then work our way up to more modern alternatives such as zstd.
Remove redundancies
Say you were building a compressor from scratch. One of your first pieces of intuition is likely to remove redundancies. At a high level, scan the data for repeating sequences, then replace each occurrence with a shorter "code". This is usually referred to as a tablebased compression.
Example. Let's run through an example to get a rough idea of tablebased compression's effectiveness. Say we have the following sequence of values, which is just ABC repeated 10 times. Ignore the spaces, as I inserted them just for visual clarity.
ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC
This takes 30 bytes to represent^{1}. Let's now use a dictionary, and replace ABC with the ðŸ”¥ emoji.
ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥
ðŸ”¥ = ABC
This could take about 15 bytes to represent^{2} — 5 bytes for the dictionary mapping ðŸ”¥ to ABC and 1 byte for every ðŸ”¥. In short, a global dictionary reduced our file size by 2x. There are several inherent problems with this approach however, which impede both compression and decompression.
Problem: A global approach is slow, memoryhungry. Both compression and decompression would consume prohibitively more resources for bigger and bigger files — This means that, ironically, the more we need a compressor, the less effectively it'd run.

For compression, we need to scan the entire file looking for repeats:
 We could store a running count of all possible repeating patterns, but this quickly consumes orders of magnitude more memory than the original file size^{3}.
 Alternatively, scan the file multiple times looking for repeats, meaning the compressor runs really slowly.

Decompression would likewise run slowly.
 A global dictionary would need to store all possible patterns, across the entire file. This means one really large dictionary stored somewhere in memory.
 Due to its large size, the dictionary would at some point exceed the cache size and require expensive DRAM â†” SRAM swaps. As a result, a global dictionary is intuitive but inefficient.
In short, exhaust memory or exhaust time. Neither are desirable.
Insight: Instead, look for repeats in a sliding window of the last several thousand values. This now means we only need to store, build, and use a small, local dictionary:
 Now, compression only considers a running count of repeating patterns in a small, fixedsize window — much more tractable than an evergrowing file.
 Decompression likewise only needs to maintain a fixedsize, smaller dictionary for lookups.
To summarize then, our first compression tool is to remove redundancies in a sliding window.
Encode local repeats
Previously, we compressed a 30byte sequence of 10 "ABCs" to the following, where we replaced all "ABC"s with ðŸ”¥:
ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥ ðŸ”¥
ðŸ”¥ = ABC
However, the resulting representation is still wasteful. Instead of storing the emojis once for every occurrence, let's just store the number of times the pattern occurs
ABC10
This takes about 4 bytes to represent — 3 bytes for the sequence ABC and 1 byte for the count. This is called a "runlength encoding" (RLE), and it compressed the original 30 bytes to just 4 bytes. All in all, runlength encoding reduced our file size by 7x.
Problem: Runlength encodings (RLE) only capture consecutive repeats, but we need a flexible way to represent repeats in different locations. For example, consider the following sequence of values.
ABCDEF
ABCDE
ABCD
ABC
Here, since repeats don't occur backtoback, RLE has no way to compress the data. As a result, we need a slightly more flexible variant.
Insight: Encode the number of positions to go backwards and the number of positions to read. This is called a lengthdistance pair. Say we want to encode our second line ABCDE. To do this using a lengthdistance pair, we can say "go backwards 6 positions and read the next 5 characters".
In this way, we can grab the first 5 letters from "ABCDEF" in the first line, to encode the second line's "ABCDE". Furthermore, using this, we can encode every line above, by going backwards by X positions and reading X1 letters. Here's what that would look like.
ABCDEF
back 6, read 5
back 5, read 4
back 4, read 3
This compresses our original 18byte sequence into roughly 12 bytes. As a result, our second compression tool is to encode repeats as a function of what we've decoded so far, without explicitly storing a table and instead using lengthdistance pairs.
At this point, we've now motivated an algorithm introduced by a pair of computer scientists Lempel and Ziv in 1977, called LZ77. More specifically, LZ77 is the combination of the two ideas so far: lengthdistance pairs over a sliding window. This algorithm is now used widely today, and in the next sections, we'll build on top of the ideas introduced by LZ77.
Reducing bit representation
Previously, we reduced the number of values. After reducing the total number of values, we can apply another trick: We can now reduce the number of bits per value. Let's start with an intuition for how this is possible. Say we have the following sequence of values.
ABCDEFGHIJKKKKKKKKKK
We assign A to 0, B to 1, C to 2 etc. This now gives us a sequence like the following, where K = 10. I've added spaces for clarity.
0123456789 10 10 10 10 10 10 10 10 10 10
Let's now compute how many total digits we needed to represent this sequence; in this case, "10" counts as two digits. Ignoring spaces, we needed 30 total digits.
However, we can make a simple change to reduce the size: Assign the most common letter K to a value with shorter length — namely, any of the single digit numbers 0 through 9. We can map A to 10, K to 0, B to 1, C to 2 etc.
10 1234567890000000000
And there we go! The total number of digits we needed is now just 21 digits. That's a roughly 30% reduction in size, just by assigning letters to numbers more judiciously. This is our next intuition: Allocate fewer digits to the most common values.
Said another way, we can further compress the data by applying a Huffman coding, which maps every value to a variable number of bits. Just like before, the more frequent values get shorter, unique sequences of bits. We're going to skip over the technical details of how to construct a Huffman coding, because we need time to dig into a more modern and more optimal alternative.
So far, what we've covered constitutes the compression algorithm known as DEFLATE. DEFLATE at a high level is simply LZ77 followed by Huffman coding, and this combination powers widelyused, popular compressors such as zip and PNG. With that said, zip isn't optimal, and there are even more efficient compressors. Let's explore one of these, called arithmetic coding.
Represent data with just a single number.
Huffman coding in the last section compresses data quickly but it doesn't do so optimally. Here's how to improve on Huffman coding's compression rates: Let's say we have the following sequence. At a high level, our goal will be to encode this sequence in a single number.
ABA
For our first meta step, our goal is to map this sequence into a range of values, such as (.44, .58). Here's how to do that: In the above sequence, there are 2 As and 1 B — in other words, 66% As and 33% Bs. At every step, we will divide our current range according to this probability distribution. Our starting range is (0,1), so the range (0, .66) corresponds to A and the range (.66, 1) corresponds to B.
Since our first letter is A, we pick the first 66%, making our new range now (0, .66).
We now subdivide this range using the same ratios: the first 66% for A and the last 33% for B.
Since our second letter is B, we pick the last 33%, making our current range now (.44, .66).
We again subdivide up this range using the same ratios: the first 66% for A and the last 33% for B.
Since our last letter is A, we pick the first 66%, making our current range now (.44, .58).
In summary, our sequence ABA now corresponds to the range (.44, .58). For our final step, pick any number in this range to represent the sequence, such as 0.45. In this way, we can now encode any sequence of characters into a single number!
To decode a sequence from a number such as 0.45, we do the reverse.
 Starting from the range (0,1), ask: Does 0.45 fall in the first 66% in (0, .66) or the last 33% in (.66, 1)? We notice that 0.45 falls in range (0, .66), so the first letter is A.
 Now, our range is (0, .66) and we repeat the same process. Ask: Does 0.45 fall in the first 66% in (0, .44) or the last 33% in (.44, .66)? 0.45 falls in the range (.44, .66), so the second letter is B.
 Now, our range is (.44, .66), and we repeat for the last letter. Ask: Does 0.45 fall in the first 66% in (.44, .58) or the last 33% in (.58, .66)? 0.45 falls in the range (.44, .58), so the third letter is A.
Knowing this, we can successfully encode and decode any sequence of characters using this process, an algorithm known as arithmetic coding. However, there is a notable issue with arithmetic coding: Namely, it runs relatively slowly, for better compression rates. Funnily enough, this is the exact opposite of Huffman coding's pros and cons — compresses fairly quickly but with lower compression rates.
Compression in the modern day
A variant called asymmetric numeral systems (ANS) combines the best of both worlds, compressing quickly at Huffman coding's speed and at high compression rates achieved by arithmetic coding. To do so, ANS approximates arithmetic coding's probability distribution using natural numbers; this allows us to encode entire sequences in a single natural number and perform computations more efficiently.
A fastgrowing compressor today, zstd, combines both LZ77 and a variant of ANS. As a testimony — and perhaps a boon — to its popularity, zstd is supported by default on a number of opensource operating systems, including Linux and Ubuntu. Needless to say then, ANS is likewise a farreaching and widelyused compression algorithm.
This is a wrap on lossless compression algorithms, widely used to compress files containing all sorts of random data — from text to images to meshes to more. With that said, there are plenty of biases in data to exploit, leading to a whole new category of compression methods: lossy compression for specific data types, such as images: How image compression works.
← back to Guide to Machine Learning

This calculation assumes we ignore spaces and encode this file using UTF8, which allocates 1 byte to each of the first 128 codepoints (e.g., the first 128 possible values), per IBM's UTF documentation. This initial set of codepoints includes all English alphanumeric characters, such as A, B and C. ↩

Technically, emojis would take more than 1 byte to represent in UTF8. We can leave this detail out for now. ↩

UTF8 for example supports about 1.1 million unique characters, with each character taking anywhere from 1 to 4 bytes. This is at worst 4.5 MB of singlecharacter patterns. Let's say we now consider every consecutive pair of characters, which is just double the size at 9 MB. In short, to search for repeats of length $k$, we're looking at $4.5 \sum_{i=k} k$ total memory consumption. With $k=5$, we're already looking at $15\times$ more memory than the original file! ↩
Want more tips? Drop your email, and I'll keep you in the loop.