from Guide to Machine Learning on Jun 02, 2024

How to compress images with language models

By "language model", we specifically mean the latest generation of Large Language Models (LLMs). Regardless of the type of language model though, this should sound like an impossible claim: A model used to generate text can compress images?

In fact, how would you even use a model that generates text to compress anything? To understand this idea, we need to revisit a few ideas from How lossless compression works, as well as Language Intuition for Transformers.

Data compression uses token distributions.

There are a number of different lossless compression methods, as we dicussed in How lossless compression works. These methods leverage a wide range of insights: remove character-level redundancies, remove "word"-level redundancies, and reduce bit width for frequent characters.

Let's focus on this last one — specifically, the variant we introduced called "arithmetic coding". Recall that in arithmetic coding, we:

Subdivide the range (0, 1) according to the probability of each character. If the data comprises of 75% As and 25% Bs, we narrow down the range to (0, .75) if the first character is A and to (.75, 1) if the first character is B.
Let's say the first character is A, so our range is now (0, .75). We subdivide this range again using the same probability distribution. Now, we narrow down the range to (0, .56) if the second character is A and (.56, .75) if the second character is B.
We continue this until the end of the sequence, iteratively and recursively subdividing this range. Once we've found a range that accounts for all values, we select any real number in that range to represent our data. We select the real number that takes the fewest number of bits to represent — to the best of our abilities.

There are two takeaways to note here.

Arithmetic coding compresses data using a probability distribution. In practice, this probability distribution is simply how frequently each character appears. However, in theory, we could any probability distribution of our choosing, to subdivide ranges.
In particular, arithmetic coding uses one probability distribution to subdivide every range. However, in theory, we could also change which probability distribution we use, each time we subdivide.

With these takeaways in mind, let's now combine data compression with language models.

Language models produce token distributions.

Recall that a Large Language Model specifically predicts a distribution over tokens: every possible token in its vocabulary is assigned a probability. Notice the natural connection to our previous discussion on data compression: Arithmetic coding uses a probability distribution. An LLM generates a probability distribution.

Naturally, then, let's ask the LLM to provide us with the probability distribution to subdivide a range! Here's how that works in more detail.

Ask the LLM for the probability of each token. For the first token, we pass in the start of sequence <sos> token, and the LLM predicts a token distribution for the first token.
Using this token distribution, divide the range (0, 1) into subranges.
Then, depending on the first character of our data, we pick a subrange. This subrange becomes our new range. Let's say this new range is (0, .5).
Now, pass in the <sos> and first character of our data into the LLM. The LLM now predicts a token distribution for the second token.
Again, using this token distribution, subdivide the new range (0, .5) into subranges again. Pick a subrange based on the second character, and this process continues.
Once we have our final range, pick a real number in the range that takes the fewest number of bits to represent.

As we alluded to above, this new compression method makes two changes: (a) We substituted the frequency-based probability distribution with an LLM's predicted probability distribution. (b) Rather than use a fixed distribution for subdividing all ranges, we iteratively query the LLM for each token's probability distribution¹. In this way, as Deletang et al claim, "Language Modeling is Compression".

Can language models really compress images?

However, there's a particularly odd result in Deletang's paper, which is mentioned in the abstract: The language model compressor can compress images, even outperforming image-specific compressors like PNG! This claim is thought-provoking, but the baseline is weaker than it should be, in three ways:

As we discussed in How lossless compression works, PNG compresses using LZ77, which is already known to underperform arithmetic coding. As a result, a more tougher — possibly more appropriate — baseline would be Zstandard, which uses an enhanced arithmetic coding called asymmetric numeral systems.
Image compressors normally compress, well, whole images. Most algorithms leverage the inherent redundancy in the entire image, whereas Deletang et al compress patch by patch, specifically on 488,821 grayscale patches of size 32x64. Granted, the LLM compressor will similarly be handicapped, so this doesn't make the comparison invalid per se.
Arithmetic coding also usually uses a fixed distribution for subdividing ranges. How much of the improvement comes from simply adjusting the distribution on every step? To test this, we could enhance arithmetic coding to update the frequency distribution based on the remaining data, after encoding each character.

To better understand the performance of LLM-based compression, let's run these updated baselines on ImageNet. Most papers, including this one, really refer specifically to ILSVRC 2012 when they say "ImageNet". Use huggingface to download the dataset.

import datasets
dataset = load_dataset("ILSVRC/imagenet-1k", split="validation")

On my machine, download took about 3 hours, and split generation took about 25 minutes. Let's start by getting a sense of scale for the dataset. Here are statistics for the original dataset, untouched. I calculate bytes assuming a raw, bitmap storage format that stores RGB in uint8.

Split	# Samples	Size (B)	Size (GB)
Train	1,281,167	843,986,170,606	843.98
Validation	50,000	34,347,852,248	34.35
Test	100,000	69,918,923,730	69.92
Total	1,431,167	948,252,946,584	948.25

Simply traversing the entire training set using Huggingface requires 90 minutes², so I only ran on 10% and extrapolated³. The entire Huggingface datset download on disk took up 166.01 GB, and according to the official ImageNet website, the ILSVRC 2012 dataset took 157.3 GB, meaning that the entire dataset was somehow just 17.5% of its theoretical size!

To understand how this was possible, take a closer look at the Huggingface dataset processor on https://huggingface.co/datasets/ILSVRC/imagenet-1k/blob/main/imagenet-1k.py#L101. It turns out that all of the images are JPEG to begin with. This makes sense, given they're random images downloaded from the web. However, as we'll see below, none of our lossless compressors are able to fully exploit this redundancy in the data.

Was PNG a strawman? No.

Let's now run a few baseline compressors on the ImageNet dataset. Instead of compressing patch by patch, I'll run the baseline compressors on entire images, as you might normally do for a compressor like PNG or JPEG.

def get_total_bytes(dataset):
    results = defaultdict(int)

    with tqdm(dataset) as pbar:
        for sample in pbar:
            raw = sample['image'].tobytes(encoder_name='raw')

            results['raw'] += len(raw)
            results['zstd'] += len(compress(raw))
            results['zstd22'] += len(compress(raw, 22))
            results['png'] += len(png(sample['image']))
    return results

On just the ImageNet validation dataset — the smallest of the three splits — here are the results for our baseline compressors.

Method	Size (B)	Size (GB)	Rate ↓	Runtime (m)
Raw	34,347,852,248	34.35	100%	5
Zstd	25,882,205,889	25.88	75.35%	55
Zstd (lvl=22)	22,526,825,310	22.53	65.58%	201
PNG	19,101,425,345	19.10	55.61%	55

This answers our first question: Was PNG a strawman? ❌ No. It appears that the image-specific PNG outperforms the generalized Zstandard. Recall that 17.5% is the theoretical best possible compressed size for ImageNet, so we're also nowhere near optimal.

Do patches adversely affect results? Yes.

Since these compressors take a significant amount of time, let's cut down our sample size to speed up experimentation. As a start, let's rerun the above on just 1 GB of data, and make sure that the rank ordering of the different techniques stays the same.

Method	Size (B)	Size (GB)	Rate ↓
Raw	1,028,238,681	1.03	100%
Zstd	768,325,046	0.77	74.72%
Zstd (lvl=22)	670,382,806	0.67	65.20%
PNG	570,829,789	0.57	55.52%

Luckily, the above took only about 6m30s minutes to run. What's more, the compression rates are very similar to the results above and this includes roughly 1 GB of image data, just like the paper's ImageNet subset did. This suggests we can run on just this subset of 1500 samples for the remainder of our experiments.

Next, we want to understand if patches change the rank ordering of different methods. Let's see if our classical algorithms perform appreciably better or worse on patches of 32x64. If significantly worse, than the baselines may be unfairly disadvantaged in comparisons.

def chunk(image, height=-1, width=-1, grayscale=True):
    """Chunk PIL image into blocks of 32x64 RGB images"""
    if height == -1 and width == -1:
        return image.tobytes(encoder_name='raw')
    image = np.array(image)
    for i in range(0, image.shape[0] // height * height, height):
        for j in range(0, image.shape[1] // width * width, width):
            yield image[i:i+height, j:j+width]

Here are the previous results, rerun on patches of 32x64.

Method	Size (B)	Size (GB)	Rate ↓
Raw	910,239,744⁴	0.91	100%
Zstd	654,083,520	0.65	71.86%
Zstd (lvl=22)	615,047,534	0.62	67.57%
PNG	513,497,258	0.51	56.41%

Interestingly, this doesn't match the paper's reported PNG compressed size of 61.70%. One more significant difference is that my patches are still in RGB, whereas the paper uses grayscale patches. Let's update our script one more time to use grayscale patches.

            if grayscale:
                chunk = np.array(Image.fromarray(chunk).convert('L'))

Now this almost exactly matches the numbers from the paper. Given that, I would say that it's appropriate to copy over values from the paper's Table 1 — these copy-pasted values are italicized.

Method	Size (B)	Size (GB)	Rate ↓
Raw	307,693,568	0.31	100%
Zstd	224,549,594	0.22	72.98%
Zstd (lvl=22)	220,219,526	0.22	71.51%
PNG	188,444,333	0.19	61.24%
PNG	—	—	61.70%
Llama2 7B	—	—	53.40%
Chinchilla 70B	—	—	48.00%

All this to say: Do patches adversely affect compression rates? ❌ No. Do grayscale adversely affect compression rates? ✅ Yes, but not enough that it's the reason to underperform LLMs as compressors. I think it's safe to say that try as I may, the paper's conclusions still stand. I had thought that Zstandard would come out on top for sure. I stand corrected.

Takeaways

This function — Chinchilla 70B, according to the paper — still takes up 140+ GB, as opposed to the single distribution that arithmetic coding normally stores. Granted, the function used to generate distributions at each step is shared globally, across all types of data. As a result, this overhead could be looked over.

The more alarming issue is how slow compression and decompression would be: Both would require LLM inference. With that said, the paper isn't promising a practical alternative to PNG. It's proving a salient point, and one that I couldn't challenge successfully: LLMs really do compress better than even domain-specific compressors. How crazy. I wouldn't have believed it.

← back to Guide to Machine Learning

It's worth noting that you don't have to actually iteratively query the LLM. Autoregressive generation isn't necessary, because we're not generating text. Instead, we have a fixed piece of data. As a result, we instead can pass the maximum context-length piece of data into the LLM and get a batched set of logits out. This is exactly how perplexity evaluation loops work. ↩
I suspect this is because the images are being loaded from disk lazily. The dataset doesn't fit in RAM and it most certainly isn't stored file-by-file on disk. ↩
100,000 samples of the ImageNet training set required 65876358867 B or 65.88 GB to store. Extrapolating linearly to 1,281,167 samples, that makes 843,986,170,606 B or 843.98 GB. ↩
For any image that did not have height perfectly divisible by 32 and/or width perfectly divisible by 64, I simply chopped off the extra pixels, which is why the total number of raw bytes changes in the table. ↩