The Model Card Benchmark Whose Methodology Shifted While Your Contract Cited the Number
Your procurement team renewed the inference contract last quarter and noted, with quiet satisfaction, that the quality clause referencing "HumanEval pass@1 of 84%" had been comfortably exceeded by the provider's latest model card, which now reports 87%. Three points to the good. The clause is satisfied. The relationship is healthy. Meanwhile, your inference team's own regression suite — the one that actually exercises the tasks your product depends on — shows a 2% decline on held-out evaluation cases since the model update shipped. Both numbers are real. Only one of them is in the contract.
This is what it looks like when a marketing artifact is load-bearing in a legal document. The benchmark number on the model card is the headline of a measurement; the methodology that produced it is a footnote in an appendix nobody on the contract review chain reads. When the provider changes the methodology — switches from greedy decode to best-of-three sampling, adds a structured-output system message, swaps the prompt template to match the model's new chat tuning — the number moves in a way that has nothing to do with your traffic and everything to do with how the number is computed. Your contract clause cites the number. The counterparty controls the protocol that produces it. You've signed a clause whose meaning the other side can revise without violating it.
