posted on Apr 7, 2024
Why Large Language Model inference is memory bound
To understand the notion of "memory" bound, we need to look at the bottlenecks of any workload on a computer. All workload throughputs are bound by either compute or memory.
Which bound applies depends on which one is slower: Does it take more time to load the data needed for computation onto chip? Or does it take more time to actually run the computation itself?
The answer to this pair of questions is largely hardware dependent, but in the case of Large Language Models, the workload is so skewed that most of inference is memory bound, regardless of the hardware.
How to determine workload bound
There are two aspects to determining which bound applies to a workload:
- Determine the arithmetic intensity of a workload $w$.
- Determine the ideal arithmetic intensity of the hardware $h$.
Arithmetic intensity is defined as the compute load over the memory load. This is the number of FLOPs used over the number of bytes moved — in other words, the "operations per byte".
If the workload's arithmetic intensity is less than the ideal hardware arithmetic intensity ($w < h$), then our workload is running fewer operations per byte than the hardware supports. Thus, the workload is "memory bound".
If the workload's arithmetic intensity is higher than the ideal hardware arithmetic intensity ($w > h$), then our workload is running more operations per byte than the hardware supports. Thus, the workload is "compute bound".
Ideal arithmetic intensity
For most publicly available GPUs, we can read public technical specifications to determine the ideal arithmetic intensity. In this post, we'll list just the ideal arithmetic intensities for the most popular GPUs — Nvidia's V100, A100, H100 and Apple's M1, M2.
- Nvidia: For these calculations, I assume a Large Language Model workload, where all computation is performed in half precision and all tensor dimensions are multiples of 8. Given these two workload attributes, we can leverage Nvidia's half precision tensor cores. We assume the weights are dense, so out of the box, we can't leverage post-Volta's sparsity optimizations. When there are different form factor options, I pick the option that supports the highest memory bandwidth.
- Apple: Only FP32 compute bandwidth is reported, so we use those numbers for our calculations. In general, there are very few publicly available statistics for the Apple M* GPUs, so we omit the later generations from our table below.
Brand | GPU | TFLOPS | TB/sec | Ideal arithmetic intensity |
---|---|---|---|---|
Nvidia | V100 | 125 | 0.9 | 139 |
Nvidia | A100 | 312 | 2.039 | 153 |
Nvidia | H100 | 1671 | 3.9 | 428 |
Apple | M1 Ultra | 21 | 0.8192 | 25.6 |
Apple | M2 Ultra | 27.2 | 0.8 | 34 |
For the Nvidia GPUs, notice that all of the above ideal arithmetic intensities are in the hundreds — and increasing rapidly. By contrast, the ideal arithmetic intensities for Apple GPUs are no more than a few dozen — with barely any change between generations. This will greatly affect which workloads run best on which hardware.
Arithmetic intensity for matrix multiplications
Before we dive into the arithmetic intensity of Large Language Models, let's first determine arithmetic intensity for a simpler workload — just a matrix multiplication between $m \times n$ and $n \times k$ matrices. For demonstration purposes, let's assume a naive matrix multiplication without tiling.
- Every output requires $2n$ reads, $1$ write, and $2n$ FLOPs.
- For an entire row of outputs, a row of the first matrix is reused, and every column in the second matrix is loaded. This means we need $n + kn$ reads, $k$ writes and $2kn$ FLOPs.
- Repeat this for $m$ rows of output, which nets $mn + kmn$ reads, $mk$ writes, and $2mkn$ FLOPs.
The arithmetic intensity of matrix multiplication is then the following
$$\frac{mn + kmn + mk}{2mkn} = \frac{n + nk + k}{2kn} = \frac{1}{2} + \frac{1}{2k} + \frac{1}{2n}$$
Just from the final expression, we can see that arithmetic intensity is maximized when $k = n = 1$, as $n, k$ are strictly positive integers. Even in this scenario, arithmetic intensity is just 1.5, a far cry from the ideal arithmetic intensities for all of the GPUs we considered. Knowing this, matrix multiplication is clearly memory bound using the naive implementation.
Per When to tile two matrix multiplies, we know that matrix multiplies can greatly improve arithmetic intensity by tiling. For each $b \times b$ tile, we would need to load $bn$ values from the first matrix and $nb$ values from the second, then write $b^2$ outputs, resulting in
$$\underbrace{(2bn + b^2)}_{\textrm{per tile}} \underbrace{\lceil\frac{m}{b}\rceil\lceil\frac{k}{b}\rceil}_{\textrm{\# tiles}}$$
total memory accesses. For each tile, we would also need $2n$ FLOPs for each one of the $b^2$ total outputs, leading to
$$\underbrace{2nb^2}_{\textrm{per tile}} \underbrace{\lceil\frac{m}{b}\rceil\lceil\frac{k}{b}\rceil}_{\textrm{\# tiles}}$$
total FLOPs. Our expression for arithmetic intensity luckily cancels out both ceiling-divide terms, leaving us with an arithmetic intensity of
$$\frac{2nb^2}{2nb + b^2} = \frac{2nb}{2n + b} = \frac{1}{\frac{1}{b} + \frac{1}{2n}}$$
From this expression, we observe that both larger tile sizes and larger inner dimensions lead to higher arithmetic intensity. To calculate an approximate value: Most tile sizes range from 8 to 128 or even 256, depending on hardware, and most model dimensions $n$ — for Large Language Models, are in the thousands, from 2048 to 8192. Given that $2n \gg b$, we can expect $\frac{1}{b}$ to dominate the denominator, and the entire expression can be approximated with
$$b$$
In most cases then, we can expect that tiled matrix multiplication is still memory bound on Nvidia GPUs, but we're significantly closer to the hardware's ideal arithmetic intensity. On Apple M* GPUs, smaller block sizes such as 8 or 16 are memory bound, whereas larger block sizes such as 128 are compute bound.
With all that said, this is only for a generic matrix multiplication. Let's now look at the transformer more specifically.
Arithmetic Intensity for the MLP
In the original transformer, the MLP was simply two matrix multiplies with a non-linearity in between. Since this non-linearity was applied element-wise, it could be fused with the first matrix multiply, so we ignore it for our purposes. The MLP's hidden dimension was generally four times the model's hidden dimension — but as we discussed above, these dimensions didn't matter anyhow. The arithmetic intensity would always grow roughly proportionally to the tile size $b$.
With more modern transformers, the MLP is slightly more complicated, involving three instead of two weights. As we discussed in Practical Introduction to Large Language Models, the modern MLP looks like the following.
$$\text{MLP}(X) = (\text{silu}(XW_G) \otimes XW_U)W_D$$
There's only one extra element-wise operation, which would have an arithmetic intensity of $\frac{1}{3}$, as there's one multiply for every 2 inputs and 1 output. Overall then, the arithmetic intensity would be $b$, dominated by the three matrix multiplies.
Arithmetic Intensity for Self-Attention
The bulk of self-attention is a series of matrix multiplies. The only exception is an element-wise division, where the outer product $QK^T$ is divided by the hidden dimension $\sqrt{d_k}$, and a softmax over the innermost dimension. We can ignore the fact that matrix multiplies within each head are independent of matrix multiplies in other heads, as these are all batched and feature the same workload profile.
We deduced previously that matrix multiplies have an arithmetic intensity of $b$ and that element-wise operations have an arithmetic intensity of $\frac{1}{3}$. Softmax in turn has an arithmetic intensity of approximately 1. All in all then, self-attention also features an arithmetic intensity of $b$.
Takeaways
We now understand that Large Language Models and matrix multiplies more generally are memory bound on most modern GPUs. However, with optimizations such as tiling — which is ubiquitous — arithmetic intensity improves greatly and there's a chance of these workloads becoming compute bound. There are two caveats however:
- Our analysis ignores the effects of batching, which allows us to reuse data if we order computation correctly. We'll leverage this in How speculative decoding works. For this reason as well, our analysis above really only applies to the autoregressive decoding part of inference; prompt processing is much more likely to be compute bound.
- We naively assumed that every output tile will need to completely re-load all data — and that this data could not be reused between tiles. This is an acceptable worst case for most matrix multiplications, but this can be optimized, as we'll see in How Flash Attention works.
Moving forward, we'll see that latency optimizations for Large Language Models, and for neural networks broadly, center on improving arithmetic intensity. This idea will become the cornerstone of our discussions.
posted on Apr 7, 2024
Want more tips? Drop your email, and I'll keep you in the loop.