from Guide to Machine Learning on Feb 24, 2024

How video compression works

Video compression involves a myriad of different ideas that span the posts we've covered so far, from image representation to compression and more. Before diving into the details, we need to distinguish between a few different categories of terms.

What's a codec vs. a format?

You've certainly heard of and used .mp4 videos, but what about HEVC, H.264, or AAC? You've likely seen one of these formats in a video editor, such as DaVinci Resolve or Adobe Premier. Let's break these down.

To connect the two ideas, you would say that your image viewer needs to support the PNG codec in order to look at images stored in PNG format. To differentiate between the two ideas, note that you can have many codecs to a single format. For example, we could define a new PNG codec that uses LZ78 — as long as our codec still conforms to the PNG format, we are guaranteed that any PNG image viewer still works on our image.

Now that we've disambiguated codecs from formats, we also need to introduce the idea of a container format. An .mp4 is one such "container format". Namely, an mp4 represents not just videos but also audio, subtitles, and metadata. In this way, the mp4 format — and any container format — is really a collection of other formats. Most software supports mp4's that use the H.264 format for video and AAC format for audio. Given this post's focus on video, we'll focus on how the H.264 codec operates.

Define different types of frames

The core insight is to encode frames in terms of other frames. Granted, not all frames can be encoded this way. For example, say a video cuts between two scenes. The first frame of the second scene can't (efficiently) be defined in terms of the previous frame. To disambiguate between frames that are highly correlated with the previous frames and those that aren't, we define several "frame types"5:

To recap, JPEG is high frequency quantization, which we describe in more detail in How image compression works. Given our thorough coverage there, we now know exactly how i-frames are compressed. i-frames, given their independence, are called intraframes.

Let's now move on to discuss how p-frames and b-frames work. Since these frames rely on other frames, they are called interframes. To be specific, let's see how these interframes are "defined relative" to other frames.

Leverage spatiotemporal redundancy

To start, video compression uses blocks of pixels called macroblocks. These are analogous to the MCUs that we described for JPEG — they're simply a group of pixels that are encoded together, as one unit. Each macroblock in an interframe is defined in terms of:

  1. A motion vector pointing to a reference macroblock. This wording is key: The motion vector points to where a macroblock comes from — not where it goes to. The process of estimating a motion vector is called motion estimation.
  2. And a residual, which is the difference between the reference and the current macroblock3 after motion estimation. The process of calculating the residual is called motion compensation specifically, block motion compensation (BMC). This collection of residual macroblocks then makes up the residual frame, which is compressed using JPEG4. Since the residuals usually contain very little information for highly-correlated frames, the residual frame is highly compressible, making interframes very space-efficient.

In this way, interframes can leverage spatiotemporal redundancies between frames very effectively. In turn, this does mean that intraframes (e.g., i-frames) are relatively expensive to encode. Despite that, there are still a number of reasons to include more i-frames in a video.


There are a large number of other features that H.264 defines, which can further improve the bitrate for video compression. This includes general improvements for loss resilience, improved entropy coding, quantization customization, and more flexible intra and inter-frame dependencies. This includes specific improvements like Integer Discrete Cosine Transform to improve DCT's computation time or extending intraframe encoding by spatially extending neighboring macroblocks to exploit spatial redundancy.

There are too many details to include in this post, so we've covered just the key ideas and major insights above. With that, we've now largely wrapped up the chain of ideas that ultimately led to the video compression format that we know and love today — the mp4.

back to Guide to Machine Learning

  1. According to the PNG "File Header", the first 8 bytes of a PNG file identifies it as such. 

  2. There's a slight caveat here, which is that p-frames and b-frames could technically reference frames before the last i-frame in decode order. This is distinct from presentation order. Knowing this, there is actually another kind of i-frame, called an "IDR frame", which tells the video decoder that all previously-decoded frames can be discarded before continuing. For a more detailed explanation of how IDR frames differ from i-frames, see this post

  3. In H.264, this is slightly more complicated. A macroblock can actually be clustered to non-square partitions of pixels. Each partition then contains a motion vector and a residual. 

  4. Note that the difference between two frames could contain negative values. To handle this, H.264 actually takes the xor between the reference and current macroblocks. 

  5. Frame types can also be called "picture types". "Pictures" are a broader term, which encompasses both frames and also "fields". Each image is compose of two fields — the odd-numbered rows and the even-numbered rows. Separating an image into two like this allows a video to be loaded progressively.