AI infrastructure economics

Compute cost per token calculator.

Build a bottoms-up TCO for an inference fleet and translate it into dollars per million output tokens. Inputs cover hardware throughput, utilization, power, PUE, energy price, capex, and useful life. Outputs decompose the cost stack and benchmark against published API list pricing. All math runs in your browser.

Inputs

Hardware Pick a class. Defaults populate throughput, power, and unit cost; override any field after.

Model size (parameters, B) Used to scale tokens per second per accelerator from the H100 70B baseline.

H100 baseline output tokens per second (per GPU, 70B FP8) SemiAnalysis and MLPerf Inference v4.1 cluster around 800 to 950 tok/s per H100 on 70B FP8 with continuous batching.

Hardware utilization (percent): 60 Wall-clock fraction of peak achieved over the year. Hyperscaler inference fleets typically run 50 to 70 percent.

Power Usage Effectiveness (PUE) Ratio of total data-center power to IT power. Hyperscale 2024 average is around 1.18 to 1.22 (Uptime Institute).

Accelerator power draw (watts) Nameplate TDP of one accelerator. H100 SXM 700W, H200 700W, B200 1000W, GB200 NVL72 1200W per GPU, TPU v5p 450W, Trillium 350W, Trainium 2 600W.

Energy cost (USD per kWh) EIA industrial electricity averaged 8.3 cents per kWh in 2024; long-term hyperscaler PPAs price closer to 5 to 7 cents.

Hardware acquisition cost (USD per accelerator) Street prices: H100 around 30k to 35k, H200 around 38k to 42k, B200 around 45k to 50k, GB200 NVL72 fully loaded around 3M (allocate 41.7k per GPU). TPUs and Trainium not retail; use cloud-equivalent capex.

Useful life (years) Hyperscalers extended depreciation to 6 years in 2023, but inference workloads at the frontier still cycle GPUs out faster.

Outputs

Effective tokens per second

510

Per accelerator, after utilization

Tokens per accelerator per year

16.1B

8,760 hours times effective tok/s

Energy cost per million tokens

$0.10

Watts times PUE times energy price

Depreciation cost per million tokens

$0.54

Capex divided by lifetime tokens

Total compute cost per million tokens

$0.64

Energy plus depreciation, output side

Implied gross margin vs Claude Sonnet list

95.7%

At $15 per 1M output tokens

API list price comparison

GPT-4o output: $5.00 per 1M (your stack: ratio computed below)
Claude Sonnet 4.5 output: $15.00 per 1M
Gemini 2.5 Pro output: $5.00 per 1M (under 200k context)
Claude Haiku 4.5 output: $1.00 per 1M
GPT-4o mini output: $0.60 per 1M

Sources: published rate cards for OpenAI, Anthropic, and Google (verified against vendor pricing pages, Q1 2026). Output-token pricing only; input tokens are roughly 3x to 5x cheaper.

Formulas

throughput_scale = (70 / model_params_B) (linear approximation, FP8)
effective_tps = h100_baseline_tps * throughput_scale * hw_throughput_multiplier * utilization
tokens_per_gpu_year = effective_tps * 3600 * 24 * 365
kw_per_gpu = (power_W * PUE) / 1000
energy_cost_per_year = kw_per_gpu * 8760 * energy_price
energy_cost_per_M_tokens = energy_cost_per_year / (tokens_per_gpu_year / 1e6)
depr_per_year = capex / useful_life_years
depr_per_M_tokens = depr_per_year / (tokens_per_gpu_year / 1e6)
total_per_M_tokens = energy_per_M + depr_per_M
gross_margin = 1 - (total_per_M_tokens / api_list_price_per_M)

Methodology

This is a unit-economics screen, not a full TCO model. It collapses the stack to two cost lines, energy and depreciation, because those two drive 70 to 85 percent of inference unit cost on modern accelerators. Network, cooling beyond PUE, racks, optical interconnect, RAM, and staff are excluded; they typically add 15 to 30 percent on top of the number printed here. Throughput scaling across model sizes is treated as inverse-linear in parameters, which holds for dense decoder-only models in the 8B to 405B range under FP8 with continuous batching; MoE architectures and speculative decoding can outperform this curve by 1.5x to 3x.

Hardware throughput multipliers, relative to H100 FP8 70B, are set from MLPerf Inference v4.1 and SemiAnalysis benchmarks: H200 1.35x (memory-bandwidth bound), B200 2.5x, GB200 per-GPU 3.1x (NVLink-72 scaling), TPU v5p 0.95x, Trillium 1.40x, Trainium 2 1.10x. These are order-of-magnitude defaults; production numbers vary with model architecture, batch size, sequence length, and software stack.

For the full Compute Cost Curve framework with six-step decomposition across acquisition, power, cooling, network, depreciation, and idle, plus model-architecture overlays and scenario tables, see the methodology one-pager. Adjacent reading: AI infrastructure insights and the AI and data economics practice.

Sources for default values and benchmarks: MLPerf Inference v4.1 (MLCommons, 2024); SemiAnalysis quarterly compute cost notes; EIA Form EIA-861 industrial electricity rates; Uptime Institute Global Data Center Survey 2024; vendor spec sheets (NVIDIA H100, H200, B200, GB200; Google TPU v5p, Trillium; AWS Trainium 2); Microsoft and Meta extended-depreciation 10-K disclosures (2023, 2024).