The Stealth Tax of Model Upgrades: What GPT-5.5 Actually Costs

OpenAI doubled the nominal price of GPT-5.5. Input went from $2.50 to $5.00 per million tokens. Output went from $15 to $30 per million tokens. A 2x sticker increase, announced in the release notes, easy to find on the pricing page.

The actual cost increase your AWS bill will report next month is somewhere between 49 and 92 percent. The variance is the whole story.

Justin Summerville at OpenRouter ran the cohort analysis on May 4. The methodology matters: he isolated users who used GPT-5.4 as a primary model in the pre-launch window of April 21 to 23, then used GPT-5.5 as a primary model in the post-launch window of April 25 to 28. Same tokenizer family, normalized per million tokens, media and cancelled and zero-token requests excluded. This is not a benchmark of two test prompts. This is what happened to real bills, on real workloads, in the four days after the upgrade.

What the cohort data shows is that the 2x sticker price does not translate cleanly to a 2x bill. Shorter prompts saw closer to a 92 percent cost increase, with no offsetting efficiency gain. Longer prompts (10,000+ tokens) saw cost increases closer to 49 percent, because GPT-5.5 generated 19 to 34 percent fewer completion tokens for the same task. The new model is more efficient on long contexts. The new model is also more expensive on every context. The two effects partially cancel for some workloads and stack together for others.

This is the stealth part of the tax. Two teams in the same company will see two different bill increases. Neither team will be able to explain why without the data.

Why Finance Cannot Tell

Finance cannot distinguish a pricing change from a usage drift in the absence of per-cohort monitoring. The bill went up 73 percent month over month. Was that the price change? Was that more users? Was that prompts getting longer because someone shipped a new feature? Was that the model generating different output lengths? In a normal SaaS line item, these questions never come up because the unit price is fixed. Token-priced AI breaks that assumption.

The default tooling does not help. Most teams send everything to one OpenAI key, get a single invoice, and let the bill aggregate. The bill tells you what you spent. It does not tell you why the per-call cost moved.

To answer “why,” you need the same thing OpenRouter built to answer it for the public: a cohort. Same workload, same prompt distribution, same user segment, before and after the change. Without that, every cost conversation collapses into “the model got more expensive” and stops there. Procurement renegotiates a contract that was not the actual problem. Engineering trims prompt length that was not the actual driver. Nothing improves.

What a Cost Cohort Actually Looks Like

A cost cohort is not exotic. It is three columns added to whatever request log you already write:

Model version at call time. Not “gpt-5”, but the exact published version. Every model upgrade is a new column value, never a silent overwrite.
Workload tag. Sales summarization, customer support draft, internal coding agent. Whatever your top five use cases are. One tag per call.
Token counts on both sides. Input tokens, output tokens, both stored, never derived later from the model name.

With those three fields, you can answer any cost question in SQL. Group by model version, hold the workload tag constant, and the price effect falls out cleanly. Group by workload tag, hold the model version constant, and usage drift falls out cleanly. The two questions stop blurring into each other.

Most teams do not have those three fields. They have a single token counter that aggregates everything and a model name that updates silently when the provider points “gpt-5” at a new version. When the bill jumps, no one can run the analysis OpenRouter ran. No one can even run a private version of it on a single workload.

The Procurement Footnote Is Now an Operating Discipline

For most of cloud’s history, cost governance has been a procurement footnote. You negotiated commitments, you tagged resources, finance ran a quarterly review, and the unit price did not move between reviews. AI inference breaks the model. Unit prices move on the provider’s schedule, often without an SLA on backwards-compatible cost. The provider doubles the headline price; whether your bill doubles, triples, or stays flat depends on prompt distribution, output length, and which workload uses the model most. Procurement cannot answer those questions. Engineering does not naturally generate the data to answer them.

Two changes follow from this:

First, the FinOps role for AI cannot be a quarterly audit. It has to be a weekly cohort review the week of every model upgrade announcement, with the data already in hand from before the upgrade. The 4-day post-launch window OpenRouter used is roughly the window an enterprise has to discover whether the new model raised costs for them specifically, or for someone else’s workload mix.

Second, the engineering responsibility for cost telemetry has to move from “we tag our spend” to “we can reproduce the provider’s analysis on our own data.” If OpenRouter can isolate a switching cohort across millions of users, your platform team can isolate a switching cohort across your hundred services. The data shape is the same. The query is shorter.

As we argued in The February 5 Convergence, the boundaries between vendor capability changes and operating expense changes are now coupled. A model upgrade is not a feature release. It is a contract change with cost implications you cannot read off the contract. As we argued in Measuring AI in Software Development, measurement discipline is what separates teams that improve from teams that hope. Cost is the measurement most teams skip first and discover last.

The Test for Your Team

Three questions, this week:

Can you tell me, for any production workload, what percentage of last month’s AI spend came from price changes versus volume changes versus prompt changes? If the answer is one number with no decomposition, the cohort discipline is missing.

Can you produce a pre-upgrade and post-upgrade snapshot of any model migration in the last 90 days, normalized per million tokens? If the answer involves explaining that the data is gone, you are paying the stealth tax and you will pay it again on the next upgrade.

Who owns the weekly review? If the answer is “finance” or “engineering” or no one, the discipline does not exist. The review is a 30-minute meeting between a platform engineer who can write the SQL and a finance partner who can read it. The output is one page.

The cost of building this is small. The cost of not building it is one major upgrade away from being on the front page of your CFO’s deck.

This analysis synthesizes GPT-5.5 Cost Analysis (OpenRouter / Justin Summerville, May 2026).

Victorino Group helps finance and platform teams build cost-cohort monitoring before the next model upgrade. Let’s talk.

Why Finance Cannot Tell

What a Cost Cohort Actually Looks Like

The Procurement Footnote Is Now an Operating Discipline

The Test for Your Team

If this resonates, let's talk