- Home
- The Thinking Wire
- Who Pays for the Compute? Model Units and the Infra Side of Inference Cost Governance
Who Pays for the Compute? Model Units and the Infra Side of Inference Cost Governance
A CFO can demand an Inference Efficiency Ratio. We argued for exactly that in the IER essay: treat cost-per-inference as a launch gate, the same way unit economics gate any other line of business. But a ratio is only governable if the denominator is real. Someone has to be able to point at a slice of GPU and say “this tenant consumed that, and here is the bill.”
Databricks’ Mosaic AI inference team published an account of how they serve large language models at scale, and buried inside the reliability story is an accounting primitive that matters more than the uptime numbers. They serve roughly 120 trillion tokens per month. They claim more than 80% GPU cost savings versus provisioning statically for peak. Those two numbers do not coexist by accident. They coexist because the system was designed to make GPU spend allocable.
This is the supply side of inference cost governance. Cost-per-inference is not only a number a finance team computes after the fact. It is an infrastructure design choice made long before the first request lands.
The Allocation Problem Nobody Names
Multi-tenant LLM serving has a quiet accounting failure baked into the naive design. You buy a fleet of GPUs. Many tenants share it. Requests arrive in bursts that do not line up across tenants. So you face the same dilemma every shared-resource system faces: provision for peak and waste most of the capacity most of the time, or provision for average and drop requests when several tenants spike together.
Static peak provisioning is how you get a GPU bill that no one can defend. The CFO asks “what did tenant A actually cost,” and the honest answer is “we bought enough hardware for everyone’s worst hour, so the question does not have a clean answer.” That is not a reporting problem. It is an architecture that refuses to be measured.
The 80% savings figure is the tell. You only save 80% against static peak if you have stopped provisioning for peak. And you can only safely stop provisioning for peak if you can move capacity between tenants fast enough that a spike in one does not starve another. That capability requires an abstraction that the raw hardware does not give you.
Model Units: Making GPU Spend Countable
The abstraction Databricks reaches for is the “model unit.” Instead of exposing tenants to raw GPUs, the platform exposes a quantum of serving capacity. A tenant is allocated model units. The system knows how many units a tenant holds, how many it is using, and what those units cost. GPU spend becomes a function of units consumed, not of hardware purchased.
This is the move that makes the IER’s denominator honest. When the unit of account is the model unit rather than the physical GPU, every layer of the stack can report against it:
- Capacity planning happens in units, so you can size a tenant’s allocation to its demand curve instead of to the fleet’s peak.
- Auto-scaling happens in units, so the system adds and removes serving capacity in increments it can price.
- Billing happens in units, so the CFO gets a per-tenant cost that traces back to a real consumption record, not a fleet-wide average smeared across everyone.
Two named systems carry this in their design. Axon is the router, the layer that decides which request goes to which serving capacity. Dicer is described as a model-unit-aware auto-sharder, meaning the component that splits and places model serving knows about units as a first-class concept. The naming matters less than the architectural commitment: units are not a billing afterthought bolted onto a GPU pool. They are the substrate the routing and sharding logic is built on.
That is the difference between a system you can govern and a system you can only invoice. In a unit-native system, the same number the auto-sharder reasons about is the number the finance team bills against. There is no translation layer where the cost story quietly diverges from the operational story.
The Reliability Half Is Also an Accounting Half
The Databricks account spends most of its length on reliability, and at first that reads as a separate concern from cost. It is not. A dropped request and a wasted GPU-second are the same failure viewed from two sides. Both mean capacity you paid for did not produce a governable outcome.
Two of their reliability lessons make the point concretely.
First, the image-processing fix. They found that PIL was roughly 10 times slower than Torchvision processors for image preprocessing. Replacing it yielded more than 3x requests per second on the affected path. Read that as a cost number, not just a throughput number. Same hardware, same model units allocated, three times the work served. The cost-per-inference on that path fell by roughly two-thirds because of one library swap in the preprocessing step. No CFO dashboard would have surfaced that. It lived in a profiler, in the supply side, exactly where the IER denominator is actually determined.
Second, the silent-hang lesson, which is the one most teams will recognize and most will have gotten wrong. They were seeing liveness-probe false failures several times a week. A liveness probe answers “is the process alive,” and a hung inference worker can pass that check while serving nothing. The process is up. The port responds. The work is not moving. Liveness probes alone cannot see this, because a silent hang is not a death; it is a stall that looks like health.
Their fix was active silent-hang detection: instead of waiting for the process to die, the system actively checks whether work is progressing and treats a stalled worker as failed. False failures went from several a week to zero. The accounting implication is direct. A hung worker holding a model unit is a unit you are paying for that serves nothing, and a probe that says it is healthy is a measurement that lies to your cost model. Active detection reclaims the unit and corrects the books.
The Two Sides Have to Match
Put the two essays together and the shape is clear. The Inference Efficiency Ratio is the demand-side instrument: the CFO’s gate, the number that decides whether a feature ships. Model units are the supply-side instrument: the infrastructure that makes that number traceable to a real consumption record per tenant.
A ratio without an allocable denominator is theater. You can publish an IER computed from a fleet-wide GPU bill divided by total inferences, but it tells you nothing actionable, because you cannot attribute it, cannot bill it, and cannot tune it tenant by tenant. The same trap appears one level up from the hyperscale saturation question: throughput at the substrate is only useful if you can price it where it is consumed.
The reliability work closes the loop. Active silent-hang detection and a 3x preprocessing win are not separate from cost governance. They are the mechanism that keeps the model unit honest, ensuring a unit you allocated is a unit doing work, and a unit doing work is a unit you can bill with a straight face.
Do This Now
Pull your largest inference workload and ask three questions of the platform team, not the finance team.
First: what is the unit of account below the GPU? If the answer is “the GPU,” you cannot allocate cost per tenant, and your IER is a fleet average wearing a per-tenant costume. Define a serving unit that routing, scaling, and billing all reference.
Second: how do you detect a hung worker? If the answer is “the liveness probe,” assume you are paying for stalled units right now. Add active progress detection that treats a non-advancing worker as failed, and measure how many false-healthy workers it reclaims in the first week.
Third: profile the preprocessing path. The Databricks 3x came from one library on the input side, not from the model. Cost-per-inference is set by the whole path, and the cheapest savings are usually upstream of the GPU, in the code no dashboard watches.
The CFO owns the ratio. The platform team owns whether the ratio means anything. Governable inference cost is not a report you generate. It is an architecture you choose.
This analysis synthesizes Reliable LLM Inference at Scale (Databricks, May 2026).
Victorino Group helps teams turn inference cost into a governed, allocable metric. Let’s talk.
All articles on The Thinking Wire are written with the assistance of Anthropic's Opus LLM. Each piece goes through multi-agent research to verify facts and surface contradictions, followed by human review and approval before publication. If you find any inaccurate information or wish to contact our editorial team, please reach out at editorial@victorinollc.com . About The Thinking Wire →
If this resonates, let's talk
We help companies implement AI without losing control.
Schedule a Conversation