Skip to main content

3 posts tagged with "quantization"

View all tags

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 8 min read
Tian Pan
Software Engineer

Most teams serving LLMs in production are burning money on GPU capacity they don't need. The root cause isn't carelessness — it's that GPU memory sizing for LLM inference involves four interacting variables (model weights, KV cache, activation memory, and framework overhead), and getting any one wrong means you over-provision the entire stack. When you multiply that error across multiple models on shared infrastructure, the waste compounds fast.

The math itself isn't hard. But most teams never do it, because "just give it an 80GB A100" is easier than calculating whether a 48GB L40S would suffice. This article walks through the arithmetic that determines how many models you can pack onto a single GPU — and the quantization tradeoffs that make it possible.

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 9 min read
Tian Pan
Software Engineer

Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.

The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.