from Guide to Machine Learning on Apr 9, 2023

Illustrated Intuition for Transformers

Large Language Models (LLMs) are large neural networks that take text as input and generate text as output. We discussed the intuition for these models in the last post; in this post, we now repeat the same process in more detail — visually construct these models from first principles, motivating and discussing the intuition for these building blocks.

Transformers are specifically designed to process text of any length[^1], meaning your input could be a word, a sentence, a paragraph, or even an essay. These modules can also generate output text of any length — anything from a sentence to a whole book.

In this post, we'll focus on how the transformer achieves the above. To better explain ideas, we'll use translation as an example. Here's a starter diagram that shows an LLM translating from French Je t'aime to I love you.

Sequence to sequence pipeline: Diagram showing inputs and outputs of a large language model, for the example translation task of translating the French Je t'aime to the English I love you. Pink means French; Purple means English.

Our goal is to figure out what goes on in the box with the question mark.

1. How to transform a word to a word

Let's focus on just transforming one word, first. The function that will transform this one word will be called our "algorithm". As a first step, this algorithm needs to perform computation on an inputted word.

Word to word pipeline: Diagram focusing on just translating the French Je to the English I. (1) Start with the French word Je. (2) Perform some computation on this word. (3) End up with English word I.Pink means French; Purple means English.

Earlier, we constructed a single-word language model that does translation:

(input) - French + English = (output)

However, what does it mean to "subtract" or "add" a word? In order to "perform computation on a word", we need to translate a word into a vector first. One such way is to use a one-hot encoding, which is illustrated below.

How to one-hot encode: Every dimension in your vector corresponds to a unique word, so every word maps to a vector of 0s, with a 1 in the corresponding dimension. In this example, our dictionary only contains 3 words: Je, You, and I. As a result, each vector is 3-dimensional. Filled-in solid color means 1. Empty squares mean 0. As an example, on the left, we convert Je to the vector [1, 0, 0].

We now augment the process from above. Instead of processing and generating words, our algorithm processes and generates one-hot encoded vectors.

Our process now (1) starts with the French word Je. (2) This word corresponds to the one-hot encoded vector [1, 0, 0], shown in pink. (3) Perform some computation on this vector. (4) We now have another one-hot encoded vector [0, 1, 0], shown in purple. (5) This vector corresponds to I.

Let's now fill in the "algorithm", denoted with the question mark, to understand what sort of computation goes on.

2. How to make arithmetic meaningful

At a high level, in order to transform a vector from French to English, we remove the French component from our vector and add an English component. The resulting vector should be the translated word.

However, this method doesn't work in the current one-hot encoded space: In the one-hot encoded space, every word lies in its own axis. As a result, all English words lie in one subspace, and all French words lie in a separate subspace. This means it's impossible to represent common meanings across languages1. In other words, there's no vector that could represent "I", independent of any language2.

To fix this, we'll adjust the vectors corresponding to each word. Start with the one-hot encoded vectors. Then,

  1. Take a large amount of text. If two words occur in the same sentence, move the corresponding vectors for those words closer. If two words don't occur together often, move them farther apart.
  2. It turns out that after doing this for a long time, arithmetic becomes meaningful. For example, you can computeking - man + woman, which becomes queen.

This vector space is called word2vec.

Illustration of word2vec: We iteratively move vectors closer, when those words co-occur often in text. In this case, French words Je and Tu co-occur much more often with each other than with the English word I.

We can now update our pipeline to use these new word2vec vectors. Our algorithm no longer processes and generates one-hot encoded vectors. Instead, our algorithm processes and generates word2vec vectors.

Pipeline using word2vec: Start with the word Je. Lookup the corresponding word2vec vector (pink). Run the algorithm on this vector to produce a new word2vec vector (purple). This then corresponds to the word I.

Now, let's try removing the French component from our input vector and adding the English component.

  1. Arithmetic: In this word2vec space, we can compute $y = \textrm{Je} - \textrm{French} + \textrm{English}$5.
  2. Nearest neighbor: We then take the resulting vector $y$ and find the closest word2vec vector that represents a word. In our example, the closest word2vec vector corresponds to "I".

This computation now forms the first version of our algorithm. Here's the new pipeline.

Full word2vec-based pipeline: Start with the word Je. Lookup the corresponding word2vec vector (pink). Subtract the French component and add the English component (pink to purple). Then, find the closest word2vec vector (purple). Finally, this word2vec vector corresponds to the word I.

Our pipeline now consists of three steps:

  1. Encode our word into a vector, using word2vec.
  2. Compute with this vector, by subtracting the French vector and adding the English vector.
  3. Decode our vector back into a word. To do this, find the closest vector that corresponds to a word. Then, output that word.

This is now a complete, end-to-end pipeline for translating a French word into the corresponding English word.

3. How to generalize

So far, we've manually constructed our pipeline so that it specifically translates French to English. However, we want to learn a generic text-to-text model, not just a specific French-to-English translator. To do this, we'll generalize our pipeline by replacing all operations with matrix-vector products. This allows us to later customize these matrices, for different tasks.

General word-to-word pipeline: Start with a word. Encode that word as a vector. Transform the vector. Take your resulting vector, and decode it back into a word.

Recall we have 3 main operations, pictured above.

  1. Encode our word into a vector.
  2. Transform this vector.
  3. Decode the output vector back into a word.

First, change how we encode. Previously, for the input word, we looked up the corresponding word2vec vector. In our new version, we perform a functionally-equivalent operation: we one-hot encode the input word, then post-multiply by a matrix of word2vec column vectors.

Encode as a matrix-vector product. The trapezoid represents a matrix multiply, going from 2 dimensions to 3 dimensions. Here, the input 2-dimensional vector is simply a one-hot vector, so the vector-matrix product simply selects one of the columns in our matrix. This is effectively looking up the word2vec vector for a corresponding input word, by the index of the input word.

Second, change how we compute. We'll represent $- \textrm{French} + \textrm{English}$ using a sequence of matrix multiplies. This allows us to perform any vector arithmetic.

Compute as matrix-vector products. The compute step is now a series of matrix multiplies, vector additions, and non-linearities. The above representation of a neural network, a pink-to-purple network labeled "transform", encapsulates all of these components. This allows the transform step to represent any vector arithmetic.

Third, change how we decode. instead of computing the nearest neighbor, we'll compute the most similar vector3; maximizing similarity can then be represented as a matrix-vector product4. This leads us to a final, generalized version of our pipeline.

Decode as matrix-vector products. The decoding step now computes a vector-matrix product, then computes the argmax of the output vector to find the index of the output word. In purple on the right, we represent this as a matrix multiply that translates from an input vector to a one-hot output vector. This is, under specific settings, equivalent to finding the nearest neighbor.

For our final step, we will now learn instead of hard-code these matrices. More importantly, we can now train a model for any single-word transformation problem, taking one word as input and producing one word as output.

4. How to use many inputs

So far, we've transformed one French word into one English word. However, multiple French words may influence the English translation.

For example, in French, the word Vous may refer one person6 or many people. We would need context to understand which this word means. Take the phrase Avez-vous faim? for example:

Knowing this, our model needs to be able to use multiple input words during the generation process. In particular, our goal in this step is to augment our pipeline to accept multiple inputs; the pipeline will go from (1) one input and one output to (2) multiple inputs and one output.

To achieve this, we will augment our encoding process. After encoding each word separately, take a weighted sum of the encodings. The goal is to add context, transforming (1) a single-word encoding into (2) a single-word encoding with context.

Weighted sum of encodings: After the "encode" step, we now have a vector representing each word. We then combine these vectors using a weighted sum, to get one vector. In the example above, we take Je, and computed a weighted sum with t' and aime. The result is a vector Je but now with context. The vector representing Je with context is then fed to the "compute" step.

We will explain how the weighted sum works in more detail, in a later section. For now, our pipeline consists of 4 steps:

  1. Encode our word into a vector.
  2. Contextualize each vector using other vectors.
  3. Transform this vector.
  4. Decode the output vector back into a word.

At a high level, we now roughly understand how to translate multiple French words into one English word. More broadly, we understand how to transform multiple input words into one output word.

5. How to generate many outputs

In this step, we will finally augment our pipeline to transform multiple input words into multiple output words. We can start by augmenting our pipeline above: Use the same method to translate each French word to English, one by one. Unfortunately, doing this makes Je t'aime into I you love.

Translating one word at a time: From left to right, we encode the word, add context, run computation, then decode back into a word. Note this transformation occurs word by word, one at a time. Unfortunately, single-word translation assumes that the positions of French and their translated English words are the same, meaning Je t'aime is translated into I you love.

To fix this, we need to generate outputs in way that does not depend on the input ordering.

Previously, for each input word separately, we generated its corresponding output word. Instead, we should use all input words together when generating each output word. To do this, we take a weighted sum of all input words, when generating the first word.

Using all words to translate: Instead of translating each word independently, we use all words to translate just the first word. From left to right, we encode the word (pink), add context, run computation (pink to purple), then take a weighted sum of all the outputs (purple) before decoding back into a word.

However, we now need a way to generate subsequent words. To do this, we reformulate the task as next-word prediction. In other words, our model is given the previous words and must predict the next word.

To predict the first word, we provide a special start-of-sequence token as the "previous" word.

Next-word prediction for the first word. Our model now predicts the next word from the input. Our input is a special "start" token (gray), which is passed to the model to generate the first word. From left to right, we encode the word, add context, run computation, then add context from the encoded input words (orange) to the start token (gray), run some more computation to transform this into the next word (orange to purple), then decode into the next word.

To generate the second word, we then feed in the first word as input. Our pipeline now consists of 6 steps:

  1. Encode our input word into an input vector.
  2. Contextualize each input vector using other input vectors.
  3. Transform this input vector.
  4. Contextualize the previous word's vector with the input vectors. If generating the first word, use a start-of-sequence token as the "zero-th word".
  5. Transform this contextualized previous word's vector into the next word's vector.
  6. Decode the output vector back into a word. Repeat this process until the output word is end-of-sequence.

Next-word prediction for the second word. Our model again predicts the next word from the input. This time, we feed in both the start token and the first word. Our model now predicts the first and second words in the output sequence.

We can continue this indefinitely to generate an unlimited number of output tokens. Once the decoder outputs a special end-of-sequence token, we stop.

Next-word prediction termination. Our model continues this process of feeding in the output back into the input, until we reach a special "end token", denoted in gray in the bottom right. This tells the network that the output sequence has ended.

This is how the transformer is able to output sequences of any length. Now, we have a complete pipeline for taking in multiple input words and generating multiple output words.

To summarize this section, we change the task to predict the next word. This allows us to predict multiple output words, by feeding our output as input to the model, repeatedly.

6. How to scale transformers

To scale up these models in size, we repeat parts of these transformers. Start with the pipeline from before.

Summary of next-word prediction pipeline: From left to right, we encode the input (pink), add context, run some computation (pink to orange), add context (orange) to the previous tokens (purple, middle). We then predict the next token as output (purple, right). We repeat this process of continually asking the model for the next word and the next word and the next… until we hit the end token.

Now, let's abstract away the details for each step in this pipeline. Every weighted sum is called "attention". The compute step is called a "multi-layer perceptron" or MLP for short.

Attention-MLP diagram: We summarize this process as the following. From left to right, we encode and add context to the input (pink, "attention"), transform the word (pink to orange, "MLP"), add context to the previous words (orange, "attention"), transform the word to the next word (orange to purple, "MLP"), then decode the vector back into a word (purple).

We can then group these operations together:

Encoder-decoder diagram: We summarize the process as the following. From the left to right, we encode all the input words. The inputs, along with the previous words 0... n-1, are then decoded into the next output words, words 1... n.

We can now repeat these building blocks:

Encoder-decoder diagram scaled up: We repeat the process from before but now expand the number of encoders and decoders we have.

We can repeat any number of encoders and decoders to make an arbitrarily large model. At its core, the architecture for a Large Language Model is simply an exceptionally large transformer.

Conclusion

We've now covered what it takes to transform multiple input words into multiple output words. Namely, we unpacked a transformer, a building block for large language models.

However, there were a number of details we glossed over — the most important of which, is how to perform a weighted sum to add context to each word. We called this "attention" very vaguely, and in the next post, we'll unpack exactly how attention works.


back to Guide to Machine Learning



  1. Let's see what happens when to subtract the French component and add the English component, in a one-hot encoded space: (1) Say we have two French words $\texttt{Je} = [1,0,0,0]$ and $\texttt{Tu} = [0,1,0,0]$. We have two English words I = [0, 0, 1, 0] and You = [0, 0, 0, 1]. (2) Start with a vector for French word Je — say, [1, 0, 0, 0]. (3) To get a vector that represents French, take the average of all the French vectors, which is [.5, .5, 0, 0]. (4) Compute Je - French which gives us [.5, -.5, 0, 0](5) To get a vector that represents English, take the average of all the English vectors, which is [0, 0, .5, .5]. (6) Compute Je - French + English, which gives us [.5, -.5, .5, .5]. (7) Compute the closest English word… But wait, now, we run into a problem. All English words are equally distant from this vector! 

  2. Technically, it's possible to construct a vector that translates into both the French Je and English I. Basically, define a vector that contains some component of both: v_I = [.5, 0, .5, 0]. Then, adding the French direction gives us [1, .5, .5, 0] = [.5, .25, .25, 0], which is closest to the vector for Je = [1, 0, 0, 0]. However, this is a bit of a degenerate construction method. The best way to illustrate this would be to add a new language. Say we expanded to include Mandarin, so all vectors are now 6 dimensions. We have the same problem in the last footnote, when we try to translate v_I into Chinese: our new vector v_I + Chinese would be equidistant from all Chinese words. 

  3. This is only true when all vectors are unit norm. In this case, $\textrm{argmin}_x \texttt{dist}(x, y) = \textrm{argmin}_x \|x-y\|_2^2 = \textrm{argmin}_x \|x\|_2^2 - 2x^Ty + \|y\|_2^2 = \textrm{argmin}_x 2 - 2x^Ty = \textrm{argmin}_x -x^Ty = \textrm{argmax}_x x^Ty$. Here, we assume that all vectors are unit norm, or that $\|x\|_2^2 = \|y\|_2^2 = 1$, which is used in the 3rd step. In the 4th step, we ignore constants, since constants do not affect the index of the minimum value. In the 5th step, we note that the minimum of negative values is the maximum of positive values. 

  4. We can represent dot product between $y$ and a number of candidate vectors $x_i$ , as a matrix-vector product $\textrm{argmax}_i x_i^Ty = \textrm{argmax} [x_0^Ty, x_1^Ty, \cdots x_n^Ty] = \textrm{argmax} Xy$. 

  5. An alternative would be to compute $y = \texttt{Je} - \textrm{proj}_\texttt{Fr}\texttt{Je} + \textrm{proj}_\texttt{En}\texttt{Je}$. Notice that since our vectors are no longer one-hot, the second and third projections in our equation are now non-zero. This algebra is more likely the first approach to the problem, given we're subtracting the French component and adding the English component. However, we chose to write $\textrm{Je} - \textrm{French} + \textrm{English}$ to exploit a surprising benefit of the word2vec space — namely, that arithmetic is semantic. 

  6. "Vous" may be used to mean "you" instead of "Tu" in a formal context, such as with strangers or elders.