Guide to Machine Learning

This guide covers fundamental topics across machine learning, with a special focus on my own research focus — namely, how to speed up inference.

Large Language Models

Here are several posts that describe the architecture for Large Language Models. They are designed to be read back-to-back, building up in complexity from high-level intuition to the fully-precise version.

Large Language Models, by virtue of being so big, have a unique problem bottlenecking inference: Inference latency is bottlenecked by memory bandwidth, not by compute bandwidth. Here's what that means in more detail.

As a result of its unique combination of challenges, LLM inference can be accelerated by altering even how matrix multiplies are done — all with the ultimate goal of reducing memory transfers.

Quantization

To understand quantization, we need to understand how compression works more broadly. These posts culminate in a discussion of what it even means to use a Large Language Model as a compressor. An LLM spits out natural language after all: What is it compressing?

To really understand quantization for neural networks, we need to look at prior applications of quantization broadly — in image compression, texture compression, video compression, and more.