4 posts tagged with "kv-cache"

The KV Cache Eviction Your Provider Called Cache Pressure and Your Bill Called a Doubled Prefix Charge

June 3, 2026 · 11 min read

Software Engineer

Your application opens a long conversation with a forty-thousand-token system prompt and a full tool inventory. Turn 1 pays the prefix at write rates and the provider's KV cache warms up. Turn 2 arrives ninety seconds later. You assume it's a cache hit. Sometimes it is. Sometimes the same forty thousand tokens land on your invoice again at uncached prices, and nothing in your code changed between turn 1 and turn 2.

The thing that changed was somebody else's traffic. The KV cache is shared infrastructure. Your tenant was colocated on a serving node that, in the ninety seconds between your two turns, took on enough other tenants to evict your prefix from memory. The provider's dashboard will describe this as "cache pressure." Your finance team will describe it as a line item that doubled. Both descriptions are accurate. Neither is in your code.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

May 31, 2026 · 10 min read

Tian Pan

Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.

The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.

About Tian Pan