AI inference economics in 2026: GPT, Claude, Gemini, and the pricing war that is rewriting the application stack: Deluair Consultancy

Inference is the largest unsolved cost line in enterprise AI. Output token prices for frontier general capability have fallen from 60 dollars per million in 2023 to between one and three dollars in 2026, a roughly 20 to 60 fold compression. Epoch AI estimates a 10x annual decline at constant capability, sustained for three years, driven by Hopper to Blackwell hardware, FP8 then FP4 quantization, paged attention and speculative decoding, and mixture of experts routing. DeepSeek V3 and R1 reset the floor in January 2025 by offering reasoning quality at one to two dollars per million output. Yet hyperscaler capex keeps climbing, with Meta, Microsoft, Amazon, and Google together guiding above 325 billion dollars for fiscal year 2025, because inference time scaling turns reasoning depth into a new spend axis. This brief unpacks the curve, the technology stack underneath it, the four way pricing war between OpenAI, Anthropic, Google, and the open weight tier, and the procurement and architecture implications for application builders, enterprise CIOs, hyperscalers, and the GPU resale chain.

The price curve from GPT-3 davinci to GPT-5 #

The single most useful anchor for any 2026 AI cost conversation is the price per million output tokens for general frontier capability. OpenAI listed text-davinci-001 at 60 dollars per million output tokens in 2020. GPT-4 launched in March 2023 at the same 60 dollars per million for the 8K context tier despite a roughly two order of magnitude gain in capability. GPT-4 Turbo in November 2023 cut output to 30 dollars. GPT-4o in May 2024 cut it again to 15 dollars. GPT-4o mini in July 2024 brought a small frontier tier to 60 cents per million output. By the time GPT-5 launched in 2025 the frontier general price had moved to single digit dollars per million output, with cached input pricing in the cents.

Anthropic ran a parallel curve. Claude 3 Opus in March 2024 listed at 75 dollars per million output, the high water mark for closed frontier reasoning. Claude 3.5 Sonnet in June 2024 came in at 15 dollars for a model that beat Opus on most benchmarks. Claude 3.5 Haiku in November 2024 sat at 4 dollars. Google followed with Gemini 1.5 Pro at 5 dollars input and 15 dollars output, then aggressively undercut on Gemini 1.5 Flash and Flash 8B at fractions of a cent. By 2026 frontier general pricing across the three closed labs has converged into a one to five dollar per million output band, with reasoning tiers at 8 to 20 dollars and small models at 10 to 80 cents.

Epoch AI documented the cadence in a March 2024 brief, finding that the price to reach a fixed MMLU level fell roughly 10x per year between 2022 and 2024, with similar slopes on coding and reasoning evals. Three years compounds to a thousand fold compression for any frozen capability target. Any workload priced on 2024 economics is overpaying by an order of magnitude in 2026, and any workload sized on 2026 economics will be overpaying again by 2027 if the curve holds.

Model	Launch	Input USD per M	Output USD per M
GPT-3 text-davinci-001	Jun 2020	60	60
GPT-4 8K context	Mar 2023	30	60
GPT-4 Turbo	Nov 2023	10	30
GPT-4o	May 2024	5	15
GPT-4o mini	Jul 2024	0.15	0.60
Claude 3 Opus	Mar 2024	15	75
Claude 3.5 Sonnet	Jun 2024	3	15
Claude 3.5 Haiku	Nov 2024	0.80	4
Gemini 1.5 Pro	May 2024	5	15
DeepSeek V3	Dec 2024	0.27	1.10
DeepSeek R1	Jan 2025	0.55	2.19

Headline list prices per million tokens at launch. Sources: OpenAI, Anthropic, Google, and DeepSeek published API pricing pages, captured at launch and verified against Internet Archive snapshots.

What is mechanically driving the 10x per year decline #

Hardware contributes roughly half of the cumulative gain. The Nvidia H100 lists near 30 thousand dollars and delivers about 4 petaFLOP of FP8 dense compute. H200 at 40 thousand adds HBM3e capacity for long context. Blackwell B200 at roughly 50 thousand delivers about 20 petaFLOP of FP4 dense compute, a 5x raw step on the precision that matters for inference, with 2.5x to 4x the throughput per dollar of TCO. MLPerf Inference 4.1 in August 2024 measured H100 at roughly 5,500 tokens per second per server on GPT-J 6B and B200 at roughly 14,000 tokens per second on the same workload.

Software has compounded on top. FlashAttention 2 and 3 collapsed attention compute and memory cost. Paged attention from vLLM, now standard in TensorRT-LLM and SGLang, raised effective utilization from 30 percent to 70 to 80 percent. Continuous batching, prefix caching, and speculative decoding each contribute multiplicative gains, with prefix caching alone often cutting agent loop costs by 3x to 10x. Anthropic, OpenAI, and Google all expose cached input pricing at roughly 10 to 25 percent of the uncached rate, which has changed how rational architects design prompts.

Quantization is the third leg. FP8 became default production precision in late 2024. FP4 weight and activation quantization is native on Blackwell and now in production for the open weight tier and several closed providers, with quality loss well inside model to model variance on standard evals. Each precision step roughly doubles throughput for memory bound serving. Architecture is the fourth leg. Mixture of experts routing, where DeepSeek V3 activates roughly 37 billion of 671 billion total parameters per token, pushed effective compute per useful answer down by another factor of two to four versus dense models.

The fifth leg is inference time scaling, which pushes the opposite way on cost. The o1 family, DeepSeek R1, Claude thinking, and Gemini deep think allocate variable thinking tokens at inference. A hard math or coding problem can consume 10 to 100 thousand thinking tokens before the visible answer. Reasoning tiers list at 8 to 20 dollars per million output tokens in 2026, and the buyer pays for thinking as well as output, so a single hard query can cost dollars. The shape is barbell. Cheap commodity inference for routine work, expensive deep reasoning for the hard slice.

The DeepSeek reset and the open weight floor #

DeepSeek V3 launched in December 2024 at 0.27 dollars per million input and 1.10 dollars per million output, with a reported training cost of 5.6 million dollars on H800 hardware, consistent with the 14.8 trillion training tokens and 2,048 H800 GPU configuration disclosed in the V3 technical report. R1 followed in January 2025 with reasoning capability competitive with o1 on math and coding benchmarks, at 0.55 dollars input and 2.19 dollars output. Nvidia shares fell roughly 17 percent on January 27, 2025, the largest single day market cap loss in US equity history at the time. The marginal frontier capability provider is now a Chinese open weight lab operating under export controls, and the price floor is set by whoever runs those weights cheapest on merchant infrastructure.

The merchant layer absorbed the reset within weeks. Together AI, Fireworks, Deepinfra, Lambda, SambaNova, Cerebras, and Groq each began serving V3 and R1 within a month, prices clustering around DeepSeek published rates. Cerebras WSE-3 hosts the full 4 trillion parameter slice on chip, and Groq LPU hits hundreds of tokens per second on smaller open weights. Bedrock and Vertex repriced their open weight tiers within the quarter. The open weight floor for reasoning capability is now firmly under three dollars per million output tokens, and the closed labs justify their premium on capability headroom, safety posture, latency, ecosystem, and contract terms rather than any structural cost advantage.

Hyperscaler capex versus collapsing unit price #

The 2026 paradox is that token prices keep falling while hyperscaler capex keeps climbing. Meta guided fiscal year 2025 capex to 60 to 65 billion dollars on January 29, 2025, up from 39 billion in 2024. Microsoft guided 80 billion. Amazon committed approximately 105 billion. Alphabet guided around 75 billion. The combined number sits above 325 billion dollars, almost entirely AI infrastructure. Nvidia data center revenue ran above 30 billion dollars per quarter through fiscal year 2025 per the 10-Q filings, with the order book extended into 2026 well before Blackwell shipped at volume.

The reconciliation is volume and reasoning. Token consumption per user, per agent, per workflow is rising faster than unit price is falling, and inference time scaling adds a spend axis that did not exist in 2023. A single agentic coding session in 2026 can consume tens of millions of tokens. Multimodal inference scales the denominator further. The hyperscalers are betting that aggregate revenue and platform stickiness justify the spend even as per token margins compress. Custom silicon, TPU v6 Trillium, Trainium 2 and Inferentia 3, Microsoft Maia and Cobalt, is designed to claw back gross margin from Nvidia on inference specifically, which is easier to specialize for than training.

The specialized inference chip cohort matters here. Groq LPU, Cerebras WSE-3, and SambaNova SN40L each target the latency sensitive slice. Cerebras filed publicly for IPO in 2024 disclosing G42 as a dominant customer. Groq raised on a 2.8 billion dollar valuation in August 2024. None displace Nvidia at scale in 2026, but they constrain pricing power on inference and validate that a meaningful share of the workload is escaping the general purpose GPU.

Vendor	FY2025 capex guide USD bn	Primary AI silicon	Inference share approx
Microsoft	80	Nvidia plus Maia	Rising, GPT family
Amazon	105	Nvidia plus Trainium 2 and Inferentia	Bedrock catalog
Alphabet	75	Nvidia plus TPU v6 Trillium	Gemini and Vertex
Meta	65	Nvidia plus MTIA	Internal Llama and ranking
Total	325

Calendar year 2025 capex guidance from earnings calls and 10-K filings, January and February 2025. Inference share is qualitative and reflects company commentary rather than a disclosed split.

The four way pricing war and where the margin pools are moving #

OpenAI holds the deepest reasoning stack with the o series and GPT-5, using Microsoft Azure as a distribution wedge while keeping direct API margins thick. Anthropic has a reputation moat on coding agents, where Claude Sonnet and Opus consistently top SWE bench Verified and the Aider leaderboard. Google uses Gemini Flash and Flash Lite as a price floor to recruit Vertex consumption, while Gemini 2.5 Pro and deep think compete at the top of the reasoning stack. The fourth player is the open weight tier, fronted by DeepSeek, Llama, Qwen, and Mistral, served by Together, Fireworks, Deepinfra, Lambda, Groq, Cerebras, and SambaNova on the merchant side, and by Bedrock, Vertex, and Foundry on the hyperscaler side.

Headline token margin is moving toward zero for commodity capability. Durable margin pools are shifting to four places. First, frontier reasoning, where willingness to pay scales with the value of the answer rather than the cost of the tokens, and where the o series, Claude Opus class, and Gemini deep think sustain double digit dollar per million pricing. Second, domain and task specialized agents, where the seller bundles tools, evals, memory, and workflow rather than raw tokens. Third, distribution and integration, where Microsoft, Google, and Amazon monetize the rest of the stack around inference. Fourth, the inference serving stack itself, where vLLM, TensorRT-LLM, SGLang, and the specialized chip vendors capture value by squeezing the curve faster than anyone else.

Recommendations by buyer #

For application builders, default to a three layer tiered routing architecture. A small model tier, Haiku 3.5 or Gemini Flash or a 7 to 13 billion parameter open weight, for classification, extraction, and routing. A general capability tier, Sonnet class or GPT-4o class, for the bulk of user facing reasoning. A frontier reasoning tier, o series or Claude Opus or Gemini deep think, gated behind explicit user intent or hard fallback. Instrument cache hit rates, token spend per user action, and quality regressions on every release. Keep prompts, tool schemas, and memory format portable across providers.

For enterprise CIOs, structure procurement around three principles. Refuse fixed price multi year token commitments above one year duration, because the curve will outrun any prepay. Negotiate cached input rates explicitly, because they drive the real cost of agent workloads. Push for transparent provisioned throughput pricing on Bedrock, Vertex, and Foundry rather than committed dollar volumes. Require token level telemetry from every vendor, including thinking tokens for reasoning models, so finance can attribute spend to features.

For hyperscalers, custom silicon execution and platform integration are the priority. Trillium, Trainium 2, and Maia each need a clear price per token win on at least one major closed model in 2026 to justify development cost. The platform layer, Bedrock catalog breadth, Foundry enterprise integration, Vertex data and tooling, is where pricing power sits because it is where switching costs accumulate. Token margin alone is not a defensible business at scale.

For GPU resellers and lessors, mark assumptions to the curve. Hopper resale value is compressing faster than the canonical four year depreciation schedule implies. B200 and B300 will follow on a similar slope as Rubin enters production. Lease terms longer than 24 months at fixed rates carry asymmetric risk. The defensible model is short cycle, high utilization, multi tenant inference, with the operator passing through hardware and software gains within weeks rather than holding margin.

Sources #

Cite this brief

@misc{hossen2026aiinferenceeconomics2026,
  author = {Hossen, Md Deluair},
  title  = {AI inference economics in 2026: GPT, Claude, Gemini, and the pricing war that is rewriting the application stack},
  year   = {2026},
  url    = {https://deluair.com/consultancy/insights/ai-inference-economics-2026},
  note   = {Deluair Consultancy briefs}
}

Hossen, M. D. (2026). AI inference economics in 2026: GPT, Claude, Gemini, and the pricing war that is rewriting the application stack. Deluair Consultancy briefs. https://deluair.com/consultancy/insights/ai-inference-economics-2026

Hossen, Md Deluair. "AI inference economics in 2026: GPT, Claude, Gemini, and the pricing war that is rewriting the application stack." Deluair Consultancy briefs, 2026-04-26. https://deluair.com/consultancy/insights/ai-inference-economics-2026.

ai inference gpt claude gemini tokens nvidia mlperf

On the watchlist

Upcoming dates that bear on this brief.

See the full firm watchlist for the rest of the calendar.

Q3 2026 Corporate

GPT-5 / Claude Opus 5 / Gemini 3 inference price floor

Whether the per-million output token floor breaks below USD 1 for frontier-grade models and how hyperscaler capex absorbs the deflation.

Q4 2026 Corporate

Samsung HBM4 mass production and NVIDIA qualification

Whether Samsung closes the gap with SK Hynix on HBM share, whether Micron stays at 9 percent, and how BIS HBM controls hold or relax under Trump 2.0.

Related insights

Adjacent reading.

AI and compute economics

AI inference economics in 2026: GPT, Claude, Gemini, and the pricing war that is rewriting the application stack