KV Cache
Everything comes at a cost
This blog requires some level of knowledge of Transformer architecture. Refresh it over here: Visualizer
Imagine you have a friend who knows the most fascinating stories, but they have a very strange quirk. Every time they want to add a new sentence to their story, they insist on starting entirely from the beginning.
If they wanted to tell you about their morning, it would sound like this:
I woke up
I woke up and made coffee
I woke up and made coffee, and read the news
If you had to listen to someone speak like this, you’d probably lose your mind before they finished a single paragraph. When the Chatbots were first introduced to us, they were just like this under the hood, they still are at some sense
To understand why AI used to be so painfully inefficient and the brilliant trick engineers invented to fix it, we need to look under the hood at how these models actually think.
Repeat Repeat 🐸
As we’ve explored before, Large Language Models (LLMs) are autoregressive. They predict just one word or a token at a time.
But to predict that next token accurately, the model can’t just guess blindly. It needs context. It has to look back at the entire prompt you gave it, plus every single word it has generated so far, to figure out what comes next.
In a standard inference setup, this creates a massive bottleneck. For every single new word the model generates, it recalculates the mathematical relationships (the “attention”) of all the previous words from scratch. It’s exactly like our annoying storyteller. As the text gets longer, the model has to repeat more and more of the exact same calculations, making the entire process slower and slower with every word it types. - yuck, so inefficient
When you are serving millions of users, repeating work like this is a recipe for crashing your servers.
Sum it up
what is 1 + 2 + 3 + 4 + 5?
Depending on how good you are with numbers, you take a moment, do the math in your head, and confidently answer: 15.
Now, what if I immediately ask you to take that sequence and add 6? You aren’t going to start over and calculate 1 + 2 + 3 + 4 + 5 + 6 (are you?). Your brain already did the heavy lifting. You simply remember the intermediate result (15), add 6 to it, and give me the new answer: 21.
Your brain automatically cached the previous information to save time.
KV Caching
In AI, what we are doing is nothing but just trying to mimic how our brain works, a little cheaper version, probably. So computer scientists just implemented that concept for LLM models and introduced a technique called Key-Value (KV) Caching.
Instead of recomputing the entire history of words from scratch, the AI now saves its “intermediate math”. When the model processes a word, it calculates mathematical representations called Keys (K) and Values (V), and it stores them in a cache.
When it’s time to predict the next word, the model doesn’t start over. It simply reaches into the cache, retrieves the stored Keys and Values of the previous words, and only does the new math required for the single new word. Once that new word is generated, its Keys and Values get added to the cache, and the cycle continues.
Speed Versus Memory
The results of this simple trick are staggering. By avoiding repeated work, KV caching can make text generation over 5 times faster. More importantly, the speed remains consistent even as the text grows incredibly long, which is perfect for extended conversations or for analyzing large documents.
But again, it’s a trade-off.
You are trading computation for memory. By not recalculating the math, the processor does a lot less work. However, the computer has to store all of those intermediate Keys and Values somewhere. As the conversation gets longer, the cache grows, consuming more and more memory.
That’s a wrap for today. I hope you enjoyed reading the article, understood it, and before I say goodbye for today, here’s a quote I’ve been pondering,
“The direction is enough to make the next choice."
If you’ve made it this far, please don’t forget to share it with your friends, family, and strangers.
Have a Great Day 💖




