from Guide to Machine Learning on Feb 17, 2024

How image compression works

A vast majority of websites use image formats such as PNG, JPEG, and SVG¹. However, these formats don't store the "original" image per se. This begs the question then: What format does store the "original" image? To answer this question, we have to consider two definitions of "original".

Raw camera sensor data. There are a number of proprietary RAW formats that directly save the camera sensor data, such as Canon's CRW or Nikon's R3D. RAW formats generally cover a gamut wider than sRGB, so for transmission, your DSLR (a.k.a., fancy camera) will usually post-process into a more common format like JPEG or PNG.
sRGB. This is the typical 3-channel RGB input with 256 possible values per channel. There are a few formats that can store the original sRGB naively, bit for bit, such as TIFF and BMP. Since these formats are more space-consuming, the vast majority of images today again are post-processed into more common formats.

We'll skip over the technical details of both RAW and un-compressed images, leaving most of that discussion instead to my previous blog post: Are there really only 17 million colors?

Given that both RAW and un-compressed images are not really used in practice on the web, we'll focus on more popular image formats — specifically, PNG and JPEG, for representing general images. These two image formats are examples of lossy vs. lossless compression.

PNG compresses images losslessly, meaning that the original image is perfectly preserved. We discussed PNG's underlying algorithm DEFLATE in How lossless compression works. You'll notice that the ideas in DEFLATE aren't particularly specific to images. In the general case for arbitrary data, lossless compression algorithms are fairly close to optimal — there's a lot less low-hanging fruit to improve on.
JPEG compresses lossily, meaning that the original image is not perfectly preserved. Instead, the compression algorithm leverages the fact that not all changes to the image are perceptible — selectively dropping "information" that results in imperceptible changes. This clever bias in the compression algorithm is what we're interested in.

We'll discuss the particulars of lossy image compression algorithms like JPEG in this post, broken up into three main sections: What defines visual quality? What biases in perception can we exploit, to reduce size with the same visual quality? How does JPEG leverage this bias?

What defines visual quality?

"Visual quality" is quite difficult to quantify, as you might have guessed. As a rough proxy for quality, many metrics gauge how similar two images are — for example, comparing a compressed image with the original, uncompressed image. Here are a few of the most common ones:

MSE is the mean-squared error. Lower values indicate higher quality.
PSNR is the log of the maximum pixel value, normalized by mean squared error. Since lower error is better, higher PSNR values indicate higher quality. There is no upper bound for PSNR, but it is lower-bounded by 0.
SSIM is a weighted product of terms that factor in luminance, contrast, and "structure". The rough idea that pixels are highly interdependent when they're right next to each other. However, these differences are less perceptible in bright regions of the image (i.e., high luminance) or when there are large "texture" changes in the image (i.e., high contrast). Higher SSIM values mean higher quality, with 1 being the highest possible.

With that said, a large number of decisions in image compression are actually driven by manual, visual inspection. The same goes for most fields related to the visual domain — computer vision, graphics, photogrammetry and more. So in summary, visual quality can be approximated by quantifying similarity with a reference image — but manual inspection with the human eye is still the gold standard.

What biases in human vision can we leverage?

There are a number of visual biases that vision scientists have discovered over the years. Some of the most prominent biases pertain to a color's context; here are a few of the consequences of that:

Color is not absolute. Take a look at these two images, with a white circle on a gray background.

In these examples, we can perceive the edge between the circle and the background pretty easily — if this edge was poorly reconstructed, we would notice this immediately. However, what's less obvious to us is the subtle change in color overall. Here are the same two images below.

Placed side by side, you can now see a subtle difference: the right image's "white" circle isn't white at all! In fact, it's just an incredibly light shade of gray. In general, our perception of color is relative to other colors in the same image.

Context affects color. Let's try another visual experiment. Which of the following two squares is brighter? We have a gray circle situated against a white background and a gray circle situated against a black background.

As you might expect, this is a trick question. These circles are exactly the same shade of gray, but the dark background makes the same gray shine brighter. Here's the same image joined at the middle, to illustrate the point.

There are also a number of other downstream effects, resulting from this idea that color is relative. Let's now see how to use these perceptual biases.

How do we exploit the relative-ness of color perception?

To exploit the fact that color is relative, we have to first understand how an image is represented in frequency space. In particular, an image can be represented by two distinct sets of frequencies — the low frequencies for structure and high frequencies for details:

Low frequencies contain the general structure of the image. If you retained only the low frequencies in an image, it would be a blurred version of the original. In this way, low frequencies encode the "relative" colors that we need to perceive an image.
High frequencies encode local details in an image. If you retained only the high frequencies, you would see edges and textures, as though you only had a line drawing of the image. The existence of these high resolution details are important, but since color is relative, we can't tell when the magnitude of these color changes changes slightly — just like we couldn't tell that the second circle was light gray and not white.

To exploit the fact that color is relative, JPEG applies a clever idea: Lower the resolution of high frequency information. Let's talk about how JPEG achieves this.

At an high level, JPEG compresses every 8x8 block in the image independently. These blocks are called Minimum Coded Unit (MCU) blocks.
Each block is first converted into frequency space using the Discrete Cosine Transform (DCT). In other words, we simply perform a pre and a post matrix multiplication.
1. The result is an 8x8 matrix where the bottom right values represent coefficients of low frequencies.
2. The top left values represent coefficients of high frequencies.
Element-wise divide by a an 8x8 table of hard-coded values. Then, round to the nearest integer.
1. This table has much larger coefficients in the top left, meaning the top-left values are quantized more heavily. Given the top-left values also correspond to high frequencies, we can equivalently say that the high frequencies are quantized heavily.
2. This table has much smaller coefficients in the bottom right, meaning the bottom-right values are quantized less aggressively. Given the bottom-right values also correspond to low frequencies, we can equivalently say that the low frequencies are quantized lightly.

All in all, high frequency information is quantized aggressively, to leverage the fact that we can't perceive subtle changes to high resolution details anyways. To summarize the above three steps, we can say that JPEG quantizes high frequencies.

Takeaways

In summary, our most popular image formats exploit several human vision biases — which can be summarized with the mantra, "color is relative". In particular, JPEG quantizes high frequency information aggressively, resulting in changes that are less perceptible to the human eye.

With that said, this is just one way to leverage one vision bias. This observation begs the question then: How else can we leverage the fact that color is relative? More broadly, what other biases in human vision can we leverage?

Alternatively, we can also narrow our focus: Are there more opportunities for specific types of images, such as grainy nighttime photos? What about highly-correlated sets of photos like frames in a video? Along these lines, let's see How video compression works next.

← back to Guide to Machine Learning

Usage statistics can be confirmed by this "W3Techs" automated survey of websites, which shows as of time of writing that 82% of websites use PNG, 77% use JPEG, 56% use SVG, and 21% use GIF. Notice the total exceeds 100%, because each website can use multiple image formats. ↩