*from* Guide to Machine Learning *on Apr 9, 2023*

# Illustrated Intuition for Transformers

Large Language Models (LLMs) are large neural networks that take text as input and generate text as output. Critically, they're made of a block, called a *transformer*. In this post, we'll visually construct these building block transformers from first principles — motivate and discuss the intuition for these building blocks and make connections to other concepts in machine learning.

Transformers are specifically designed to process text of any length^{1}, meaning your input could be a word, a sentence, a paragraph, or even an essay. These modules can also generate *output text* of any length — anything from a sentence to a whole book.

In this post, we'll focus on how the transformer achieves the above. To better explain ideas, we'll use translation as an example. Here's a starter diagram that shows an LLM translating from French `Je t'aime`

to `I love you`

.

**Sequence to sequence pipeline: **Diagram showing inputs and outputs of a large language model, for the example translation task of translating the French `Je t'aime`

to the English `I love you`

. Pink means French; Purple means English.

Our goal is to figure out what goes on in the box with the question mark.

## 1. How to transform a word to a word

Let's focus on just transforming one word, first. The function that will transform this one word will be called our "algorithm". As a first step, this algorithm needs to perform computation on an inputted word.

**Word to word pipeline: **Diagram focusing on just translating the French `Je`

to the English `I`

. (1) Start with the French word `Je`

. (2) Perform some computation on this word. (3) End up with English word `I`

.Pink means French; Purple means English.

However, in order to "perform computation on a word", we need to translate a word into a vector first. One such way is to use a one-hot encoding, which is illustrated below.

**How to one-hot encode**: Every dimension in your vector corresponds to a unique word, so every word maps to a vector of `0`

s, with a `1`

in the corresponding dimension. In this example, our dictionary only contains 3 words: `Je`

, `You`

, and `I`

. As a result, each vector is 3-dimensional. Filled-in solid color means 1. Empty squares mean 0. As an example, on the left, we convert `Je`

to the vector `[1, 0, 0]`

.

We now augment the process from above. Instead of processing and generating words, our algorithm processes and generates one-hot encoded vectors.

*Our process now (1) starts with the French word Je. (2) This word corresponds to the one-hot encoded vector [1, 0, 0], shown in pink. (3) Perform some computation on this vector. (4) We now have another one-hot encoded vector [0, 1, 0], shown in purple. (5) This vector corresponds to I.*

Let's now fill in the "algorithm", denoted with the question mark, to understand what sort of computation goes on.

## 2. How to make arithmetic meaningful

At a high level, in order to transform a vector from French to English, we remove the French component from our vector and add an English component. The resulting vector should be the translated word.

However, this method doesn't work in the current one-hot encoded space: In the one-hot encoded space, every word lies in its own axis. As a result, all English words lie in one subspace, and all French words lie in a separate subspace. This means it's impossible to represent common meanings across languages^{2}. In other words, there's no vector that could represent "I", independent of any language^{3}.

To fix this, we'll adjust the vectors corresponding to each word. Start with the one-hot encoded vectors. Then,

- Take a large amount of text. If two words occur in the same sentence, move the corresponding vectors for those words closer. If two words don't occur together often, move them farther apart.
- It turns out that after doing this for a long time, arithmetic becomes meaningful. For example, you can compute
`king - man + woman`

, which becomes`queen`

.

This vector space is called word2vec.

**Illustration of word2vec**: We iteratively move vectors closer, when those words co-occur often in text. In this case, French words `Je`

and `Tu`

co-occur much more often with each other than with the English word `I`

.

We can now update our pipeline to use these new word2vec vectors. Our algorithm no longer processes and generates one-hot encoded vectors. Instead, our algorithm processes and generates word2vec vectors.

**Pipeline using word2vec**: Start with the word `Je`

. Lookup the corresponding word2vec vector (pink). Run the algorithm on this vector to produce a new word2vec vector (purple). This then corresponds to the word `I`

.

*Now*, let's try removing the French component from our input vector and adding the English component.

**Arithmetic:**In this word2vec space, we can compute $y = \textrm{Je} - \textrm{French} + \textrm{English}$^{6}.**Nearest neighbor:**We then take the resulting vector $y$ and find the closest word2vec vector that represents a word. In our example, the closest word2vec vector corresponds to "I".

This computation now forms the first version of our algorithm. Here's the new pipeline.

**Full word2vec-based pipeline: **Start with the word `Je`

. Lookup the corresponding word2vec vector (pink). Subtract the French component and add the English component (pink to purple). Then, find the closest word2vec vector (purple). Finally, this word2vec vector corresponds to the word `I`

.

Our pipeline now consists of three steps:

**Encode**our word into a vector, using word2vec.**Compute**with this vector, by subtracting the French vector and adding the English vector.**Decode**our vector back into a word. To do this, find the closest vector that corresponds to a word. Then, output that word.

This is now a complete, end-to-end pipeline for translating a French word into the corresponding English word.

## 3. How to generalize

So far, we've manually constructed our pipeline so that it specifically translates French to English. However, we want to learn a generic text-to-text model, not just a specific French-to-English translator. To do this, we'll generalize our pipeline by replacing all operations with matrix-vector products. This allows us to later customize these matrices, for different tasks.

**General word-to-word pipeline**: Start with a word. Encode that word as a vector. Transform the vector. Take your resulting vector, and decode it back into a word.

Recall we have 3 main operations, pictured above.

**Encode**our word into a vector.**Transform**this vector.**Decode**the output vector back into a word.

First, change how we *encode*. Previously, for the input word, we looked up the corresponding word2vec vector. In our *new* version, we perform a functionally-equivalent operation: we one-hot encode the input word, then post-multiply by a matrix of word2vec column vectors.

**Encode as a matrix-vector product**. The trapezoid represents a matrix multiply, going from 2 dimensions to 3 dimensions. Here, the input 2-dimensional vector is simply a one-hot vector, so the vector-matrix product simply selects one of the columns in our matrix. This is effectively looking up the word2vec vector for a corresponding input word, by the index of the input word.

Second, change how we *compute*. We'll represent $- \textrm{French} + \textrm{English}$ using a sequence of matrix multiplies. This allows us to perform any vector arithmetic.

**Compute** **as matrix-vector product**s. The compute step is now a series of matrix multiplies, vector additions, and non-linearities. The above representation of a neural network, a pink-to-purple network labeled "transform", encapsulates all of these components. This allows the transform step to represent any vector arithmetic.

Third, change how we *decode*. instead of computing the nearest neighbor, we'll compute the most similar vector^{4}; maximizing similarity can then be represented as a matrix-vector product^{5}. This leads us to a final, generalized version of our pipeline.

**Decode** **as matrix-vector products**. The decoding step now computes a vector-matrix product, then computes the argmax of the output vector to find the index of the output word. In purple on the right, we represent this as a matrix multiply that translates from an input vector to a one-hot output vector. This is, under specific settings, equivalent to finding the nearest neighbor.

For our final step, we will now learn instead of hard-code these matrices. More importantly, we can now train a model for any single-word transformation problem, taking one word as input and producing one word as output.

## 4. How to use many inputs

So far, we've transformed *one* French word into *one* English word. However, *multiple* French words may influence the English translation.

For example, in French, the word `Vous`

may refer one person^{7} or many people. We would need context to understand which this word means. Take the phrase `Avez-vous faim?`

for example:

- This phrase could translate into "Are
*you folks*hungry?" if subsequent context implies there are multiple people. For example,`Avez-vous faim? Nous sommes cinq`

, translates into "Are you folks hungry? There are 5 of us". - Alternatively, this phrase could translate into "Are
*you*hungry?" if subsequent context implies just one person. For example,`Avez-vous faim? Il y a un sandwich supplÃ©mentaire`

translates into "Are you hungry? There is an extra sandwich."

Knowing this, our model needs to be able to use multiple input words during the generation process. In particular, our goal in this step is to augment our pipeline to accept multiple inputs; the pipeline will go from (1) *one* input and one output to (2) *multiple* inputs and one output.

To achieve this, we will augment our encoding process. After encoding each word separately, take a weighted sum of the encodings. The goal is to add context, transforming (1) a single-word encoding into (2) a single-word encoding *with context*.

**Weighted sum of encodings**: After the "encode" step, we now have a vector representing each word. We then combine these vectors using a weighted sum, to get *one* vector. In the example above, we take `Je`

, and computed a weighted sum with `t'`

and `aime`

. The result is a vector `Je`

but *now* with context. The vector representing `Je`

with context is then fed to the "compute" step.

We will explain how the weighted sum works in more detail, in a later section. For now, our pipeline consists of 4 steps:

**Encode**our word into a vector.**Contextualize**each vector using other vectors.**Transform**this vector.**Decode**the output vector back into a word.

At a high level, we now roughly understand how to translate *multiple *French words into *one* English word. More broadly, we understand how to transform *multiple* input words into *one* output word.

## 5. How to generate many outputs

In this step, we will finally augment our pipeline to transform *multiple* *input* words into *multiple output words*. We can start by augmenting our pipeline above: Use the same method to translate each French word to English, one by one. Unfortunately, doing this makes `Je t'aime`

into `I you love`

.

**Translating one word at a time**: From left to right, we encode the word, add context, run computation, then decode back into a word. Note this transformation occurs word by word, one at a time. Unfortunately, single-word translation assumes that the positions of French and their translated English words are the same, meaning `Je t'aime`

is translated into `I you love`

.

To fix this, we need to generate outputs in way that does not depend on the input ordering.

Previously, *for each input word separately*, we generated its corresponding output word. Instead, we should *use all input words together* when generating each output word. To do this, we take a weighted sum of all input words, when generating the first word.

**Using all words to translate**: Instead of translating each word independently, we *use* all words to translate just the first word. From left to right, we encode the word (pink), add context, run computation (pink to purple), then take a weighted sum of all the outputs (purple) before decoding back into a word.

However, we now need a way to generate subsequent words. To do this, we reformulate the task as *next-word prediction*. In other words, our model is given the *previous* words and must predict the *next* word.

To predict the first word, we provide a special *start-of-sequence* token as the "previous" word.

**Next-word prediction** for the first word. Our model now predicts the *next* word from the input. Our input is a special "start" token (gray), which is passed to the model to generate the first word. From left to right, we encode the word, add context, run computation, then add context from the encoded input words (orange) to the start token (gray), run some more computation to transform this into the next word (orange to purple), then decode into the next word.

To generate the second word, we then feed in the first word as input. Our pipeline now consists of 6 steps:

**Encode**our input word into an input vector.**Contextualize**each input vector using other input vectors.**Transform**this input vector.**Contextualize**the previous word's vector with the input vectors. If generating the first word, use a*start-of-sequence*token as the "zero-th word".**Transform**this contextualized previous word's vector into the next word's vector.**Decode**the output vector back into a word. Repeat this process until the output word is*end-of-sequence*.

**Next-word prediction** for the second word. Our model again predicts the next word from the input. This time, we feed in both the start token *and* the first word. Our model now predicts the first and second words in the output sequence.

We can continue this indefinitely to generate an unlimited number of output tokens. Once the decoder outputs a special *end-of-sequence* token, we stop.

**Next-word prediction termination**. Our model continues this process of feeding in the output back into the input, until we reach a special "end token", denoted in gray in the bottom right. This tells the network that the output sequence has ended.

This is how the transformer is able to output sequences of any length. Now, we have a complete pipeline for taking in multiple input words *and* generating multiple output words.

To summarize this section, we change the task to predict the *next* word. This allows us to predict multiple output words, by feeding our output as input to the model, repeatedly.

## 6. How to scale transformers

To scale up these models in size, we repeat parts of these transformers. Start with the pipeline from before.

**Summary of next-word prediction pipeline**: From left to right, we encode the input (pink), add context, run some computation (pink to orange), add context (orange) to the previous tokens (purple, middle). We then predict the next token as output (purple, right). We repeat this process of continually asking the model for the next word and the next word and the nextâ€¦ until we hit the end token.

Now, let's abstract away the details for each step in this pipeline. Every weighted sum is called "attention". The compute step is called a "multi-layer perceptron" or MLP for short.

**Attention-MLP diagram**: We summarize this process as the following. From left to right, we encode and add context to the input (pink, "attention"), transform the word (pink to orange, "MLP"), add context to the previous words (orange, "attention"), transform the word to the next word (orange to purple, "MLP"), then decode the vector back into a word (purple).

We can then group these operations together:

- Consider the (1) attention over input words and (2) MLP blocks as part of an encoder. The purpose of the encoder is to add context to each word and make any needed transformations.
- Consider the (1) attention over inputs and previous words and (2) MLP blocks as part of a decoder. The purpose of the decoder is to transform the inputs into an encoding for the next word.

**Encoder-decoder diagram**: We summarize the process as the following. From the left to right, we encode all the input words. The inputs, along with the previous words `0... n-1`

, are then decoded into the next output words, words `1... n`

.

We can now repeat these building blocks:

- We insert multiple encoders back-to-back; every encoder passes its outputs to the next encoder as inputs.
- We also insert multiple decoders back-to-back; every decoder takes in two inputs: (1) the final encoder's outputs and (2) the previous decoder's outputs.