from Guide to Machine Learning on Feb 24, 2024

How video compression works

Video compression involves a myriad of different ideas that span the posts we've covered so far, from image representation to compression and more. Before diving into the details, we need to distinguish between a few different categories of terms.

What's a codec vs. a format?

You've certainly heard of and used .mp4 videos, but what about HEVC, H.264, or AAC? You've likely seen one of these formats in a video editor, such as DaVinci Resolve or Adobe Premier. Let's break these down.

Codec: A codec is a pair of compression and decompression algorithms. In fact, "codec" is an abbreviation of "compressor-decompressor". For example, we discussed PNG in our previous post on How lossless compression works. We discussed the algorithm itself — the combination of LZ77 and DEFLATE — and this algorithm is what we call the PNG codec. A codec is how data is compressed.
Format: This is the way data is represented, byte for byte. For example, the PNG format states that the first 8 bytes of any PNG file must be 504E47 — in hexadecimal¹. The format goes on to specify the next 4 bytes contain the width, the 4 bytes after contain the height, etc. In this way, the format describes what data is stored where.

To connect the two ideas, you would say that your image viewer needs to support the PNG codec in order to look at images stored in PNG format. To differentiate between the two ideas, note that you can have many codecs to a single format. For example, we could define a new PNG codec that uses LZ78 — as long as our codec still conforms to the PNG format, we are guaranteed that any PNG image viewer still works on our image.

Now that we've disambiguated codecs from formats, we also need to introduce the idea of a container format. An .mp4 is one such "container format". Namely, an mp4 represents not just videos but also audio, subtitles, and metadata. In this way, the mp4 format — and any container format — is really a collection of other formats. Most software supports mp4's that use the H.264 format for video and AAC format for audio. Given this post's focus on video, we'll focus on how the H.264 codec operates.

Define different types of frames

The core insight is to encode frames in terms of other frames. Granted, not all frames can be encoded this way. For example, say a video cuts between two scenes. The first frame of the second scene can't (efficiently) be defined in terms of the previous frame. To disambiguate between frames that are highly correlated with the previous frames and those that aren't, we define several "frame types"⁵:

An i-frame is an image. For the purposes of our discussion, an i-frame is specifically a JPEG-compressed image. An i-frame stands alone, independent of other frames in the video.
A p-frame defines an image relative to a previous frame. Specifically in the H.264 codec, a p-frame can actually use multiple previous frames.
A b-frame defines an image relative to the previous and next frames. Again, specifically in the H.264 codec, a b-frame can use multiple of either frames.

To recap, JPEG is high frequency quantization, which we describe in more detail in How image compression works. Given our thorough coverage there, we now know exactly how i-frames are compressed. i-frames, given their independence, are called intraframes.

Let's now move on to discuss how p-frames and b-frames work. Since these frames rely on other frames, they are called interframes. To be specific, let's see how these interframes are "defined relative" to other frames.

Leverage spatiotemporal redundancy

To start, video compression uses blocks of pixels called macroblocks. These are analogous to the MCUs that we described for JPEG — they're simply a group of pixels that are encoded together, as one unit. Each macroblock in an interframe is defined in terms of:

A motion vector pointing to a reference macroblock. This wording is key: The motion vector points to where a macroblock comes from — not where it goes to. The process of estimating a motion vector is called motion estimation.
And a residual, which is the difference between the reference and the current macroblock³ after motion estimation. The process of calculating the residual is called motion compensation — specifically, block motion compensation (BMC). This collection of residual macroblocks then makes up the residual frame, which is compressed using JPEG⁴. Since the residuals usually contain very little information for highly-correlated frames, the residual frame is highly compressible, making interframes very space-efficient.

In this way, interframes can leverage spatiotemporal redundancies between frames very effectively. In turn, this does mean that intraframes (e.g., i-frames) are relatively expensive to encode. Despite that, there are still a number of reasons to include more i-frames in a video.

Lack of correlation: An i-frame could be created if the frame is uncorrelated with its neighboring frames. In this sense, it could be less efficient to encode differences between sets of unrelated frames.
Random access: Every i-frame is the start of a new group of pictures (GOP), and both b-frames and p-frames can only reference other frames within the GOP. Due to this restriction, we can start playing a video from any i-frame. Knowing this, i-frames may also be created for random access reasons, so we can seek to many points in a video².
Error resilience: In H.264, b-frames can reference other b-frames and p-frames. As a result, errors can propagate from one frame to the next. Alternatively, i-frames could be created to reduce cascading errors. Since references to other frames do not extend past an i-frame, i-frames stop cascading errors from propagating to other GOPs.

Takeaways

There are a large number of other features that H.264 defines, which can further improve the bitrate for video compression. This includes general improvements for loss resilience, improved entropy coding, quantization customization, and more flexible intra and inter-frame dependencies. This includes specific improvements like Integer Discrete Cosine Transform to improve DCT's computation time or extending intraframe encoding by spatially extending neighboring macroblocks to exploit spatial redundancy.

There are too many details to include in this post, so we've covered just the key ideas and major insights above. With that, we've now largely wrapped up the chain of ideas that ultimately led to the video compression format that we know and love today — the mp4.

← back to Guide to Machine Learning

According to the PNG "File Header", the first 8 bytes of a PNG file identifies it as such. ↩
There's a slight caveat here, which is that p-frames and b-frames could technically reference frames before the last i-frame in decode order. This is distinct from presentation order. Knowing this, there is actually another kind of i-frame, called an "IDR frame", which tells the video decoder that all previously-decoded frames can be discarded before continuing. For a more detailed explanation of how IDR frames differ from i-frames, see this post. ↩
In H.264, this is slightly more complicated. A macroblock can actually be clustered to non-square partitions of pixels. Each partition then contains a motion vector and a residual. ↩
Note that the difference between two frames could contain negative values. To handle this, H.264 actually takes the xor between the reference and current macroblocks. ↩
Frame types can also be called "picture types". "Pictures" are a broader term, which encompasses both frames and also "fields". Each image is compose of two fields — the odd-numbered rows and the even-numbered rows. Separating an image into two like this allows a video to be loaded progressively. ↩