- KV caching significantly reduces time to first token (TTFT) latency and improves throughput for LLM calls by caching previous computations and reusing them across requests.
- Experiments show that each cached input token saves approximately 0.15ms, leading to substantial reductions in TTFT when caching is used across multiple tokens.
- Implementing KV caching in Glean Assistant enhances user experience by delivering faster and more efficient responses, crucial for maintaining satisfactory user sessions.
Welcome back to our series on LLM latency! In our previous blog, we took a close look at how input token count impacts the latency of LLM chat tools. In this blog, we’ll explore how KV (key-value) caching impacts time to first token (TTFT) latency and throughput for LLM calls.
The impact of KV caches on TTFT
LLMs are autoregressive models, where the generation of the ith token depends on all the tokens generated prior. This means that computing the attention scores for the ith token involves all the same operations as done for the i-1th token, plus the additional computations for this latest token. This is a great opportunity to cache.
Caching these values has the following implications:
- The initiation phase, which we learned in the previous blog refers to generating the first token, is unaffected by the KV caching strategy since there are no previous steps. This phase now populates the KV cache for subsequent stages.
- For the decoding phase we no longer use the whole sequence as input but only the last generated token and the KV cache.
So, how does attention computation scale now? As we discussed in the last blog, computing attention scores within a Transformer involves doing matrix multiplications and multiplying a matrix of shape (n, p), with another matrix (p, m), involves approximately 2*m*n*p operations. In the case of the of the attention layer, m and n are both equal to the size of the context window, w, meaning the cost of this operation becomes 2*p*w^2—a quadratic relation. However, since the query for subsequent generations is now a single token, the computation is linear (2*p*1^2).
This explains how completions within a single LLM call benefit from caching, but if this cache is persisted, it can be used across LLM calls. Two possible ways this can be taken advantage of are:
1. Caching in multi-turn use cases within a single conversation
2. Caching across workflows that have similar prefixes
Being able to reuse the KV cache across requests should enable TTFT latency wins as well as throughput improvements due to the memory footprint of concurrent requests with a shared prefix.
Experiment results
Luckily for us, Azure’s KV cache can persist these computations across LLM calls. Their documentation states that “the amount of throughput that you can achieve on the endpoint is a factor of the input size, output size, call rate, and cache match rate.” In fact, they suggest “mixing the calls can reduce your cache hit rate as they're both competing for the same space. When possible, it's recommended to have separate deployments for each workload”.
The data of using this cache across requests can be seen below. Each data point was obtained via 25 GPT4 Turbo calls using Azure’s provisioned throughput units (PTU). Cached calls had the exact same prompt over the 25 calls. Meanwhile, the uncached runs had the current request time as a prefix in each prompt, invalidating the prefix cache.
To note here again, in case you missed our first latency blog, we used Azure’s PTU model deployments to run these tests. PTUs are a way to obtain reserved processing capacity just for you—unlocking predictable performance and mitigating the latency swings seen via their pay-a-you-go services. This ensures that our data isn't being muddled due to data center loads.
The plot below zooms into the bottom left corner for better readability.
This shows us that each cached input token saves ~0.15ms, the difference between the two slopes. I’d like to emphasize that 0.15ms might seem like a negligible amount, but the data illustrates that caching 1000 tokens across calls results in a reduction in TTFT by 100 ms!
Making Glean a better experience
When it comes to user experiences, we know that even a few milliseconds can make or break a satisfactory session. We’ll keep these results in mind as we continue to build and improve Glean Assistant to deliver a better, speedier service for our users.
Looking to learn more about Glean? Get a free demo today—or check out our careers page if you’re interested in helping us build the future of enterprise AI yourself!