Speculative Decoding

understanding how AIs learned to write faster

May 04, 2026

Ever noticed how modern AI models like ChatGPT or Claude type out their answers word... by... word? It’s mesmerizing, like watching an invisible ghost press the keys. But why does a supercomputer worth millions of dollars type like my grandpa?

To understand this, we have to look at how Large Language Models (LLMs) actually think. They are what computer scientists call “autoregressive”→ meaning they predict just one word (or token) at a time. And to predict that single next token, the AI has to look at the entire prompt plus every single word it has generated so far.

But here is where we run into a brick wall, and it connects perfectly to something we talked about in our last article.

The Master Chef is Waiting Again

Remember our exploration of the Roofline Model? We talked about the Master Chef (the processor) and the slow Waiter (the memory bandwidth).

When an LLM predicts a token, its massive “brain”, the billions of model weights lives in the computer’s memory. For every single word it generates, the GPU has to load those massive weights from memory into its compute cores.

The actual math to predict the word? Lightning fast. But fetching the data? Incredibly slow. Once again, our world-class Master Chef does a microsecond of chopping and then just stands around waiting for the waiter to bring the next potato. This is the exact definition of being Memory Bound.

With LLMs serving millions of users, predicting one token at a time at this pace simply doesn’t scale. Something had to change. We couldn’t make the memory physically faster, so software engineers did something brilliant: they changed the architecture of how the Chef works.

Enter the Sous-Chef (Speculative Decoding)

The solution is a clever technique called Speculative Decoding. Instead of just relying on the massive, slow-to-load target model, engineers pair it with a tiny “draft model”.

Think of the draft model as a scrappy, fast, but slightly less accurate Sous-Chef. Because this draft model is tiny, it doesn’t get bottlenecked by memory. It can run ahead and quickly guess (or draft) the next 3 to 5 words in a fraction of a second.

Then, it hands those drafted words to the massive target model (the Master Chef). Because of how GPU architecture is designed, the massive target model can load its heavy weights just once and use a technique called causal masking to verify all 5 of the Sous-Chef’s guesses at the exact same time.

Instead of doing one word per pass, the big model is confirming a whole batch of words at once, drastically speeding up the output while maintaining the exact same high-quality answers.

But What if the Sous-Chef is Wrong?

Now, I know what you’re thinking what happens if the draft model makes a mistake? What if the target model looks at the drafted tokens and completely disagrees?

Let’s say the prompt is, “The capital of France is...” The eager Sous-Chef rapidly drafts: “Paris, which is a nice...” The Master Chef evaluates all five words at once. It agrees with “Paris,” “which,” and “is.” But it rejects “a” and “nice,” because the Master Chef wants to say “beautiful.”

Here is the magic: the system simply accepts the first three correct words, throws away the rejected token and everything after it, and immediately inserts its own correct word (“beautiful”). The Sous-Chef is then told to start drafting the next batch of words from “beautiful”. Even with the rejection, we still got four words (three drafted + one corrected) for the “price” of loading the big model’s weights once. It’s a massive win.

The Acceptance Rate

The entire success of this trick hinges on a metric called the Acceptance Rate → how often the Master Chef actually agrees with the Sous-Chef.

High Acceptance Rate: If you ask the AI to write structured code or a JSON file, the next words are highly predictable. The two models will agree almost constantly, and the text will fly onto your screen.
Low Acceptance Rate: If you ask the AI to write a creative, freeform poem, the models will likely diverge very early on. The Master Chef will keep rejecting the draft, and you won’t gain much speed.

Do We Always Use It?

You might wonder why we don’t just leave this turned on all the time. It all comes down to server traffic.

If it’s the middle of the night and server traffic is low, the GPUs have spare capacity. Running the draft model is essentially free real estate, so Speculative Decoding is a no-brainer. But during a high-traffic rush hour, the server is already pushed to its limits. Adding a second draft model into the mix would just increase memory pressure, slowing down the overall system just to speed up an individual request.

To manage this, engineers use launch flags to toggle the feature, route traffic to separate endpoints based on the task, or rely on managed platforms that balance continuous batching and speculative decoding automatically behind the scenes.

The next time you watch an AI type out an answer with sudden, rapid bursts of words, you’ll know exactly what’s happening. A tiny Sous-Chef is furiously guessing ahead, and the Master Chef is nodding along.

That’s a wrap for today. I hope you enjoyed reading the article, understood it, and before I say goodbye for today, here’s a quote I’ve been pondering:

“Relationships are probably the most important part of life. Take care of the great ones."

If you’ve made it this far, please don’t forget to share it with your friends, family, and strangers.

Have a Great Day 💖

Discussion about this post

Ready for more?