Guide to Machine Learning
This guide covers fundamental topics across machine learning, with a special focus on my own research focus — namely, how to speed up inference.
Large Language Models
Here are several posts that describe the architecture for Large Language Models. They are designed to be read back-to-back, building up in complexity from high-level intuition to the fully-precise version.
- Language Intuition for Transformers
- Illustrated Intuition for Transformers
- Illustrated Intuition for Self-Attention
- Practical Introduction to Large Language Models
- How Large Language Model training works
Large Language Models, by virtue of being so big, have a unique problem bottlenecking inference: Inference latency is bottlenecked by memory bandwidth, not by compute bandwidth. Here's what that means in more detail.
- Why Large Language Model inference is memory bound
- How speculative decoding works
- How speculative sampling works
As a result of its unique combination of challenges, LLM inference can be accelerated by altering even how matrix multiplies are done — all with the ultimate goal of reducing memory transfers.
- How to tile matrix multiplication
- When to tile two matrix multiplies
- When to fuse multiple matrix multiplies
- How Flash Attention works
Quantization
To understand quantization, we need to understand how compression works more broadly. These posts culminate in a discussion of what it even means to use a Large Language Model as a compressor. An LLM spits out natural language after all: What is it compressing?
To really understand quantization for neural networks, we need to look at prior applications of quantization broadly — in image compression, texture compression, video compression, and more.
- Are there really only 17 million colors?
- How image compression works
- How video compression works
- How NormalFloat4 works
Frameworks
Want more tips? Drop your email, and I'll keep you in the loop.