from Guide to Machine Learning on Apr 02, 2023

Language Intuition for Transformers

Large Language Models (LLMs) have gained widespread popularity and adoption but are fairly complex to understand. In this post, we explain how an LLM works; critically, we explain how it works using just natural language.

Large Language Models (LLMs) are effectively scaled-up versions of an operation called a transformer. A transformer is specifically designed to process input text of any length¹, meaning your input could be a word, a sentence, a paragraph, or even an essay. These modules can also generate output text of any length — anything from a sentence to a whole book.

This transformer performs the heavy lifting in a language model. However, these transformers are fairly complicated, resulting in complex architecture diagrams and un-intuitive explanations.

In this post, we'll take an intuition-first approach to avoid this information overload, introducing the intuition for how transformers function using only natural language and a tiny smidgen of math.

Here's a starter diagram that shows an LLM translating from French Je t'aime to English I love you. Our goal is to figure out what goes on in the box with the question mark.

Sequence to sequence pipeline: Diagram showing inputs and outputs of a large language model, for the example translation task of translating the French Je t'aime to the English I love you. Pink means French; Purple means English.

1. How to transform a word to a word

Let's focus on just transforming one word, first. In particular, let's say we can apply math to words, and that addition or subtraction can change a word's meaning. For example, we can transform the gender of a word:

king - man + woman = queen

We can also change the language for a word:

Je - French + English = I

Multiplication can represent quantities:

people * many = crowd

Or represent quantities we can't count:

happy * a lot = elated

For now, entertain² the idea that we can add, subtract, multiply, negate etc words, as though they were numbers.

This means that we can now build simplistic language models that transform a word into another word. For example, we can build a language model that transforms genders:

(input) - man + woman = (output)

We can then use this single-word language model to transform king into queen, him into her, or actor into actress.

Say we have another single-word language model that does translation:

(input) - French + English = (output)

We can now translate individual words:

French Je into English I
French tu into English you
French aime into English love

These single-world translations are all correct on their own, so we're now able to translate the first word in our example correctly.

Word to word pipeline: Diagram focusing on just translating the French Je to the English I. (1) Start with the French word Je. (2) Perform some computation on this word. (3) End up with English word I.Pink means French; Purple means English.

However, what about the rest of our example? Since this model performs single-word translation, we can only translate one word at a time. This single-word translator thus translates:

French Je t'aime into English I you love

Unfortunately, despite a perfect French sentence, we get broken English out.

To fix this, we need to evolve our single-word model into a multi-word model — in other words, a model that can take in many input words and generate many output words. For brevity, we will refer to the "multi-word input" by its proper name, a prompt.

2. How to use many inputs

To accept a multi-word prompt, we need to incorporate context across words. To do this, we'll take a prompt as input, then generate a new version of that prompt where every word is contextualized.

To understand why context is important, consider the following example.

cooler

We don't know if this word refers to:

a portable container to keep food cool
a lower temperature
one person being more attractive than another

In other words, we need more information to understand what "cooler" means. The neighboring words that provide this information are what we call the context for "cooler".

Say our full prompt is

Grab cheese from the cooler.

In this case, the context is "cheese from the cooler", which tells us "cooler" is a container to keep food cool.

This context is critical, as it affects how we transform and use a word from the prompt. For example, context would influence how we translate "cooler" from English to French. As a result, our model incorporates context as a first step.

To do this, we start with a target word to add context to. We call this the query. In the above example, say our query is "cooler".
We then iterate over all of the other words. These are the words that will provide context. We call these words the keys. In the above example, our keys are "Grab", "cheese", "from", "the".
For every query-key pair, we ask, "How important is the key for understanding the query?" In other words, "How much context does one word provide another?"
We then incorporate important keys into the query. Conceptually, this means contextualizing "cooler" to become "container to keep food cool". We will discuss what "contextualizing" means more formally in the next post Illustrated Intuition for Transformers.

Our prompt, with a contextualized "cooler", looks like the following

Grab cheese from the (container to keep food cool).

This is just an example of how to contextualize a single word. During inference, we contextualize all words in the prompt in this way. Start by contextualizing our first word, Grab.

Grab is our first query.
For this query, we then consider all other words in the prompt to be keys. In this case, our keys are "cheese", "from", "the", "cooler".
We then ask, "How important is the key 'cheese' for understanding the query 'Grab'?" We repeat this for all keys. This gives us a list of importances, one for each key.
We then incorporate important keys into the query. In this case, the transformer might say that "cheese" is important for understanding "Grab". So, we incorporate "cheese" into "Grab" to get "Grab cheese".

We repeat this for all words in the prompt, treating each word as query and all other words as keys. This would then transform our original sentence:

Grab cheese from the cooler.

After contextualizing every word, we then get the following:

(Grab cheese) (cheese in the cooler) (from) (the) (container that keeps food cool).

This is a bit wordy, so to abbreviate this, I'll use square brackets to denote a contextualized word, replacing (container that keeps food cool) with [cooler]. After contextualizing every word, we then get the following:

[Grab] [cheese] [from] [the] [cooler].

We are now done with this contextualizing step. This step, which adds context to each word in our prompt, is formally called self-attention.

Now, we can pass contextualized words from our prompt, into our single-word language model. Say we have the previous single-word language model which translates from English to French.

input - English + French = output

Then, say the prompt is the above "Grab cheese from the cooler". Previously, we may have translated

cooler - English + French = réfrigérateur

In other words, we translated cooler into "refrigerator". However, now that we have contextualized versions of each word, we can now use that as input instead

[cooler] - English + French = glaciére

This now successfully translates "cooler" into "a portable, insulated container for keeping food cool".

Above, remember that the contextualized "cooler" denoted using [cooler] above incorporates information across multiple words. As a result, this language model that took in [cooler] as input actually takes in multi-word inputs.

However, we are still producing single-word outputs. In the next step, we'll generate multiple output words.

3. How to generate many outputs

To generate many output words, we'll need to change our objective. We'll continue to predict just one word, but now, we'll predict the next word given the previous word. Our language model is now

generate(previous) = next

Above, generate is an unspecified equation that takes in the previous word as input and produces the next word. Let's now use a specific example.

Consider the example from before, where our language model translates the French prompt "Je t'aime" to the English "I love you". Say we have already generated the first output "I" and are now generating the second word.

Contextualize the previous word I using the prompt "Je t'aime". This gives us [I].
Pass the contextualized previous word into the model. generate([I]).
This produces the next word, love.

We can then repeat this process again to generate the next word.

Contextualize the previous word love using the prompt "Je t'aime" and the previous word "I", to produce [love].
Pass all contextualized previous words into the model generate([I], [love]).
This produces the next word you.

We can then continue this indefinitely, always generating the next word one at a time. This gives us multi-word outputs. Here is how our model would be executed

generate([I]) = love
generate([I], [love]) = you

However, how do we generate the first word? And, how do we know when to stop generating? We need two special "words" that allow us to start and stop, called the "start of sequence" and "end of sequence" words, accordingly.

To generate the first word, use the <sos> or start-of-sequence word as the "previous" input word.
Continue to generate words until your model produces the <eos> or end-of-sequence word.

generate([<sos>]) = I
generate([<sos>], [I]) = love
generate([<sos>], [I], [love]) = you
generate([<sos>], [I], [love], [you]) = <eos>

This gives us the multi-word output "I love you", at last.

This process — where we generate the output sequence one word at a time, using the previously-generated words — is called autoregressive decoding.

This now concludes our pipeline for a language model. In short, we're able to take in multi-word prompts and generate multi-word outputs.

Takeaways

Here is a brief summary of the different components we covered, in order to build up a transformer using only natural language.

The multi-word input the user provides is called the prompt.
To accept multi-word prompts, we built a self-attention module, which adds context to every word in our prompt, using all other words.
Context is taken from words (keys) and added to the target word (query).
To produce multi-word outputs, we built an autoregressive decoding algorithm, which generates the next word one-by-one from all the previous words.
To generate the first word, use the start-of-sequence token. Continue generating until we output the end-of-sequence token.

This concludes our introduction to transformers using natural language intuition. With that said, we made a few oversimplifications in this post that you should be aware of.

What we oversimplified

There are several simplifications above that were necessary to convey a natural-language explanation.

Language models operate not on words but on tokens. Tokens are parts of words, punctuation, all sorts of special characters and whitespace. Effectively, a token can be any type-able character. For your understanding, you can treat tokens and words as identical. Keep in mind though that in a codebase, these two concepts aren't actually equivalent — about one-and-a-third tokens corresponds to one word.
We defined math operations on words, allowing us to add, subtract, multiply words together, meaningfully. The reality is that we can't perform math on words directly. The word2vec paper appears to do so though, so for this post, we assume that math on words is valid. We will explain in the next post what word2vec actually does and how we can still achieve a similar result by performing math on words indirectly. It's also worth noting Large Language Models (LLMs) don't use word2vec; we use this for intuition only.

To address these oversimplifications and also to add more rigor in our introduction, see the next post Illustrated Intuition for Transformers.

← back to Guide to Machine Learning

In reality, transformers support inputs up to a certain length. ↩
The particulars of this math aren't important. For example, it's not clear what exponentiation means. The king-queen addition and subtraction example is a famous one from a paper and utility called word2vec. ↩