Silent Quantization: Why the Model You Pay For Today Isn't the Model You Paid For Last Quarter
The model name on your invoice is the same as it was last quarter. The version string in the API response hasn't changed. The model card and pricing page read identically. And yet your eval scores have drifted half a point downward, your refusal patterns shifted in ways your prompts didn't ask for, and a handful of customer complaints came in last Tuesday about output that "feels different." You debug your code. You don't find anything. The code didn't change. The weights did.
Silent quantization is the gap between the model you contracted for and the model the provider is actually serving. It happens because inference economics keep tightening — every dollar of GPU capacity has to feed more requests this quarter than last — and the cheapest way to absorb that pressure is to re-host the same model name on cheaper precision tiers. FP16 becomes FP8. FP8 becomes FP4 in some routes. Mixed-precision shards get swapped in. The version string doesn't move because the version string was never a precision contract; it was a marketing contract.
