from Guide to Machine Learning on Apr 2, 2023

Language Intuition for Transformers

Large Language Models (LLMs) have gained widespread popularity and adoption but are fairly complex to understand. In this post, we explain how an LLM works; critically, we explain how it works using just natural language.

Large Language Models (LLMs) are effectively scaled-up versions of an operation called a transformer. This transformer performs the heavy lifting in a language model, allowing the model to accept multiple words as input and generate multiple words as output. However, these transformers are fairly complicated, resulting in complex architecture diagrams and un-intuitive explanations.

In this post, we'll take an intuition-first approach to avoid this information overload, introducing the intuition for how transformers function using only natural language and a tiny smidgen of math.

Applying math to words

Let's say we can apply math to words, and that addition or subtraction can change a word's meaning. For example, we can transform the gender of a word:

king - man + woman = queen

We can also change the language for a word:

Je - French + English = I

Multiplication can represent quantities:

people * many = crowd

Or represent quantities we can't count:

happy * a lot = elated

For now, entertain1 the idea that we can add, subtract, multiply, negate etc words, as though they were numbers.

This means that we can now build simplistic language models that transform a word into another word. For example, we can build a language model that transforms genders:

input - man + woman = output

We can then use this single-word language model to transform king into queen, him into her, or actor into actress.

Say we have another single-word language model that does translation:

input - French + English = output

We can translate Je into I, tu into you, and aime into love. Unfortunately, since this model performs single-word translation, we can only perform word-by-word translation to transform "Je t'aime" into "I you love". Despite a perfect French sentence, we get broken English out.

To fix this, we need to evolve our single-word model into a multi-word model — in other words, a model that can take in many input words and generate many output words. For brevity, we will refer to the "multi-word input" by its proper name, a prompt.

Accepting many input words

To accept a multi-word prompt, we need to incorporate context across words. To do this, we'll take a prompt as input, then generate a new version of that prompt where every word is contextualized.

To understand why context is important, consider the following example.


We don't know if this word refers to:

  1. a portable container to keep food cool
  2. a lower temperature
  3. one person being more attractive than another

In other words, we need more information to understand what "cooler" means. The neighboring words that provide this information are what we call the context for "cooler".

Say our full prompt is

Grab cheese from the cooler.

In this case, the context is "cheese from the cooler", which tells us "cooler" is a container to keep food cool.

This context is critical, as it affects how we transform and use a word from the prompt. For example, context would influence how we translate "cooler" from English to French. As a result, our model incorporates context as a first step.

  1. To do this, we start with a target word to add context to. We call this the query. In the above example, say our query is "cooler".
  2. We then iterate over all of the other words. These are the words that will provide context. We call these words the keys. In the above example, our keys are "Grab", "cheese", "from", "the".
  3. For every query-key pair, we ask, "How important is the key for understanding the query?"
  4. We then incorporate important keys into the query. Conceptually, this means contextualizing "cooler" to become "container to keep food cool".

Our prompt, with a contextualized "cooler", looks like the following

Grab cheese from the [container to keep food cool].

This is a bit wordy, so to abbreviate this, I'll use square brackets to denote a contextualized word, like this:

Grab cheese from the [cooler].

We repeat this for all words in the prompt, treating each word as query and all other words as keys. After contextualizing every word, we then get the following:

[Grab] [cheese] [from] [the] [cooler].

We are now done with this contextualizing step. This step, which adds context to each word in our prompt, is formally called self-attention.

Now, we can pass contextualized words from our prompt, into our single-word language model. Say we have the previous single-word language model which translates from English to French.

input - English + French = output

Then, say the prompt is the above "Grab cheese from the cooler". Previously, we may have translated

cooler - English + French = réfrigérateur

In other words, we translated cooler into "refrigerator". However, now that we have contextualized versions of each word, we can now use that as input instead

[cooler] - English + French = glaciére

This now successfully translates "cooler" into "a portable, insulated container for keeping food cool". Remember that the contextualized "cooler" denoted using [cooler] above incorporates information across multiple words; as a result, this language model actually takes in multi-word inputs.

However, we are still producing single-word outputs. In the next step, we'll generate multiple output words.

Generating many output words

To generate many output words, we'll need to change our objective. We'll continue to predict just one word, but now, we'll predict the next word given the previous word. Our language model is now

generate(previous) = next

Above, f is an unspecified equation that takes in the previous word as input and produces the next word. Let's now use a specific example.

Consider the example from before, where our language model translates the French prompt "Je t'aime" to the English "I love you". Say we have already generated the first output "I" and are now generating the second word.

  1. Contextualize the previous word I using the prompt "Je t'aime". This gives us [I].
  2. Pass the contextualized previous word into the model. generate([I]).
  3. This produces the next word, love.

We can then repeat this process again to generate the next word.

  1. Contextualize the previous word love using the prompt "Je t'aime" and the previous word "I", to produce [love].
  2. Pass all contextualized previous words into the model generate([I], [love]).
  3. This produces the next word you.

We can then continue this indefinitely, always generating the next word one at a time. This gives us multi-word outputs. Here is how our model would be executed

generate([I]) = love
generate([I], [love]) = you

However, how do we generate the first word? And, how do we know when to stop generating? We need two special "words" that allow us to start and stop, called the "start of sequence" and "end of sequence" words, accordingly.

generate([<sos>]) = I
generate([<sos>], [I]) = love
generate([<sos>], [I], [love]) = you
generate([<sos>], [I], [love], [you]) = <eos>

This gives us the multi-word output "I love you", at last.

This process — where we generate the output sequence one word at a time, using the previously-generated words — is called autoregressive decoding.

This now concludes our pipeline for a language model. In short, we're able to take in multi-word prompts and generate multi-word outputs.


Here is a brief summary of the different components we covered, in order to build up a transformer using only natural language.

  1. The multi-word input the user provides is called the prompt.
  2. To accept multi-word prompts, we built a self-attention module, which adds context to every word in our prompt, using all other words.
  3. Context is taken from words (keys) and added to the target word (query).
  4. To produce multi-word outputs, we built an autoregressive decoding algorithm, which generates the next word one-by-one from all the previous words.
  5. To generate the first word, use the start-of-sequence token. Continue generating until we output the end-of-sequence token.

This concludes our introduction to transformers using natural language intuition. With that said, we made a few oversimplifications in this post that you should be aware of.

What we oversimplified

There are several simplifications above that were necessary to convey a natural-language explanation.

To address these oversimplifications and also to add more rigor in our introduction, see the next post Illustrated Intuition for Transformers.

  1. The particulars of this math aren't important. For example, it's not clear what exponentiation means. The king-queen addition and subtraction example is a famous one from a paper and utility called word2vec. 

Got a question? Ask me on Twitter, at @lvinwan. Want more tips? Drop your email below, and I'll keep you in the loop.

back to Guide to Machine Learning