from Guide to Machine Learning on Feb 3, 2024
How lossless compression works
There are all sorts of compressed files floating around the web. Most of these take on familiar file extensions, such as .zip
or .gz
. However, how do these compressed formats actually work?
Formalities: Compress vs. Archive, Lossless vs. Lossy
Before we discuss compressors in detail though, we need to distinguish compressors from archivers.
- Compressors reduce the storage size for a single file.
-
Archivers reduce the storage size for a group of files.
- Archivers simply bundle files together, then run a third-party compressor on the bundle. Technically speaking then, archivers don't reduce storage size on their own.
- For example, .tar files fall in this category, but since tar on its own isn't compressing the data, you'll often find a second file extension after tar, such as
.tar.gz
, indicating tar was used to archive, and gzip was used to compress.
We're going to focus on the first one — compressors. Within compression algorithms, there are mainly two types:
- Lossless compression algorithms exactly reconstruct the original data without any loss in quality.
- Lossy compression algorithms sacrifice some fidelity in the reconstruction. For images for example, we may sacrifice reconstruction accuracy if we can compress further while preserving perceived visual quality.
Again, we're going to focus on the first one — lossless compression. Specifically, in this post, we'll discuss how compression algorithms work in popular tools such as zip, then work our way up to more modern alternatives such as zstd.
Remove redundancies
Say you were building a compressor from scratch. One of your first pieces of intuition is likely to remove redundancies. At a high level, scan the data for repeating sequences, then replace each occurrence with a shorter "code". This is usually referred to as a table-based compression.
Example. Let's run through an example to get a rough idea of table-based compression's effectiveness. Say we have the following sequence of values, which is just ABC repeated 10 times. Ignore the spaces, as I inserted them just for visual clarity.
ABC ABC ABC ABC ABC ABC ABC ABC ABC ABC
This takes 30 bytes to represent1. Let's now use a dictionary, and replace ABC with the 🔥 emoji.
🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥
🔥 = ABC
This could take about 15 bytes to represent2 — 5 bytes for the dictionary mapping 🔥 to ABC and 1 byte for every 🔥. In short, a global dictionary reduced our file size by 2x. There are several inherent problems with this approach however, which impede both compression and decompression.
Problem: A global approach is slow, memory-hungry. Both compression and decompression would consume prohibitively more resources for bigger and bigger files — This means that, ironically, the more we need a compressor, the less effectively it'd run.
-
For compression, we need to scan the entire file looking for repeats:
- We could store a running count of all possible repeating patterns, but this quickly consumes orders of magnitude more memory than the original file size3.
- Alternatively, scan the file multiple times looking for repeats, meaning the compressor runs really slowly.
-
Decompression would likewise run slowly.
- A global dictionary would need to store all possible patterns, across the entire file. This means one really large dictionary stored somewhere in memory.
- Due to its large size, the dictionary would at some point exceed the cache size and require expensive DRAM ↔ SRAM swaps. As a result, a global dictionary is intuitive but inefficient.
In short, exhaust memory or exhaust time. Neither are desirable.
Insight: Instead, look for repeats in a sliding window of the last several thousand values. This now means we only need to store, build, and use a small, local dictionary:
- Now, compression only considers a running count of repeating patterns in a small, fixed-size window — much more tractable than an ever-growing file.
- Decompression likewise only needs to maintain a fixed-size, smaller dictionary for lookups.
To summarize then, our first compression tool is to remove redundancies in a sliding window.
Encode local repeats
Previously, we compressed a 30-byte sequence of 10 "ABCs" to the following, where we replaced all "ABC"s with 🔥:
🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥 🔥
🔥 = ABC
However, the resulting representation is still wasteful. Instead of storing the emojis once for every occurrence, let's just store the number of times the pattern occurs
ABC10
This takes about 4 bytes to represent — 3 bytes for the sequence ABC and 1 byte for the count. This is called a "run-length encoding" (RLE), and it compressed the original 30 bytes to just 4 bytes. All in all, run-length encoding reduced our file size by 7x.
Problem: Run-length encodings (RLE) only capture consecutive repeats, but we need a flexible way to represent repeats in different locations. For example, consider the following sequence of values.
ABCDEF
ABCDE
ABCD
ABC
Here, since repeats don't occur back-to-back, RLE has no way to compress the data. As a result, we need a slightly more flexible variant.
Insight: Encode the number of positions to go backwards and the number of positions to read. This is called a length-distance pair. Say we want to encode our second line ABCDE. To do this using a length-distance pair, we can say "go backwards 6 positions and read the next 5 characters".
In this way, we can grab the first 5 letters from "ABCDEF" in the first line, to encode the second line's "ABCDE". Furthermore, using this, we can encode every line above, by going backwards by X positions and reading X-1 letters. Here's what that would look like.
ABCDEF
back 6, read 5
back 5, read 4
back 4, read 3
This compresses our original 18-byte sequence into roughly 12 bytes. As a result, our second compression tool is to encode repeats as a function of what we've decoded so far, without explicitly storing a table and instead using length-distance pairs.
At this point, we've now motivated an algorithm introduced by a pair of computer scientists Lempel and Ziv in 1977, called LZ77. More specifically, LZ77 is the combination of the two ideas so far: length-distance pairs over a sliding window. This algorithm is now used widely today, and in the next sections, we'll build on top of the ideas introduced by LZ77.
Reducing bit representation
Previously, we reduced the number of values. After reducing the total number of values, we can apply another trick: We can now reduce the number of bits per value. Let's start with an intuition for how this is possible. Say we have the following sequence of values.
ABCDEFGHIJKKKKKKKKKK
We assign A to 0, B to 1, C to 2 etc. This now gives us a sequence like the following, where K = 10. I've added spaces for clarity.
0123456789 10 10 10 10 10 10 10 10 10 10
Let's now compute how many total digits we needed to represent this sequence; in this case, "10" counts as two digits. Ignoring spaces, we needed 30 total digits.
However, we can make a simple change to reduce the size: Assign the most common letter K to a value with shorter length — namely, any of the single digit numbers 0 through 9. We can map A to 10, K to 0, B to 1, C to 2 etc.
10 1234567890000000000
And there we go! The total number of digits we needed is now just 21 digits. That's a roughly 30% reduction in size, just by assigning letters to numbers more judiciously. This is our next intuition: Allocate fewer digits to the most common values.
Said another way, we can further compress the data by applying a Huffman coding, which maps every value to a variable number of bits. Just like before, the more frequent values get shorter, unique sequences of bits. We're going to skip over the technical details of how to construct a Huffman coding, because we need time to dig into a more modern and more optimal alternative.
So far, what we've covered constitutes the compression algorithm known as DEFLATE. DEFLATE at a high level is simply LZ77 followed by Huffman coding, and this combination powers widely-used, popular compressors such as zip and PNG. With that said, zip isn't optimal, and there are even more efficient compressors. Let's explore one of these, called arithmetic coding.
Represent data with just a single number.
Huffman coding in the last section compresses data quickly but it doesn't do so optimally. Here's how to improve on Huffman coding's compression rates: Let's say we have the following sequence. At a high level, our goal will be to encode this sequence in a single number.
ABA
For our first meta step, our goal is to map this sequence into a range of values, such as (.44, .58). Here's how to do that: In the above sequence, there are 2 As and 1 B — in other words, 66% As and 33% Bs. At every step, we will divide our current range according to this probability distribution. Our starting range is (0,1), so the range (0, .66) corresponds to A and the range (.66, 1) corresponds to B.
Since our first letter is A, we pick the first 66%, making our new range now (0, .66).
We now sub-divide this range using the same ratios: the first 66% for A and the last 33% for B.
Since our second letter is B, we pick the last 33%, making our current range now (.44, .66).
We again sub-divide up this range using the same ratios: the first 66% for A and the last 33% for B.
Since our last letter is A, we pick the first 66%, making our current range now (.44, .58).
In summary, our sequence ABA now corresponds to the range (.44, .58). For our final step, pick any number in this range to represent the sequence, such as 0.45. In this way, we can now encode any sequence of characters into a single number!
Note that we should pick a number that takes fewer bits to represent. For example, say our range was (0, 1e-6). We could choose our number to be 5.9e-7, the smallest number that can be represented with half-precision, which uses 16 bits. However, if we use any smaller number such as 1e-7, our number now requires single precision, which uses 32 bits. As a result, picking the right number will further ensure the maximum compression ratio.
To decode a sequence from a number such as 0.45, we do the reverse.
- Starting from the range (0,1), ask: Does 0.45 fall in the first 66% in (0, .66) or the last 33% in (.66, 1)? We notice that 0.45 falls in range (0, .66), so the first letter is A.
- Now, our range is (0, .66) and we repeat the same process. Ask: Does 0.45 fall in the first 66% in (0, .44) or the last 33% in (.44, .66)? 0.45 falls in the range (.44, .66), so the second letter is B.
- Now, our range is (.44, .66), and we repeat for the last letter. Ask: Does 0.45 fall in the first 66% in (.44, .58) or the last 33% in (.58, .66)? 0.45 falls in the range (.44, .58), so the third letter is A.
Knowing this, we can successfully encode and decode any sequence of characters using this process, an algorithm known as arithmetic coding. However, there is a notable issue with arithmetic coding: Namely, it runs relatively slowly, for better compression rates. Funnily enough, this is the exact opposite of Huffman coding's pros and cons — compresses fairly quickly but with lower compression rates.
Compression in the modern day
A variant called asymmetric numeral systems (ANS) combines the best of both worlds, compressing quickly at Huffman coding's speed and at high compression rates achieved by arithmetic coding. To do so, ANS approximates arithmetic coding's probability distribution using natural numbers; this allows us to encode entire sequences in a single natural number and perform computations more efficiently.
A fast-growing compressor today, zstd, combines both LZ77 and a variant of ANS. As a testimony — and perhaps a boon — to its popularity, zstd is supported by default on a number of open-source operating systems, including Linux and Ubuntu. Needless to say then, ANS is likewise a far-reaching and widely-used compression algorithm.
This is a wrap on lossless compression algorithms, widely used to compress files containing all sorts of random data — from text to images to meshes to more. With that said, there are plenty of biases in data to exploit, leading to a whole new category of compression methods: lossy compression for specific data types, such as images: How image compression works.
← back to Guide to Machine Learning
-
This calculation assumes we ignore spaces and encode this file using UTF-8, which allocates 1 byte to each of the first 128 codepoints (e.g., the first 128 possible values), per IBM's UTF documentation. This initial set of codepoints includes all English alphanumeric characters, such as A, B and C. ↩
-
Technically, emojis would take more than 1 byte to represent in UTF-8. We can leave this detail out for now. ↩
-
UTF-8 for example supports about 1.1 million unique characters, with each character taking anywhere from 1 to 4 bytes. This is at worst 4.5 MB of single-character patterns. Let's say we now consider every consecutive pair of characters, which is just double the size at 9 MB. In short, to search for repeats of length $k$, we're looking at $4.5 \sum_{i=k} k$ total memory consumption. With $k=5$, we're already looking at $15\times$ more memory than the original file! ↩
Want more tips? Drop your email, and I'll keep you in the loop.