AI infrastructure economics

Compute cost per token calculator.

Build a bottoms-up TCO for an inference fleet and translate it into dollars per million output tokens. Inputs cover hardware throughput, utilization, power, PUE, energy price, capex, and useful life. Outputs decompose the cost stack and benchmark against published API list pricing. All math runs in your browser.

Inputs

Pick a class. Defaults populate throughput, power, and unit cost; override any field after.
Used to scale tokens per second per accelerator from the H100 70B baseline.
SemiAnalysis and MLPerf Inference v4.1 cluster around 800 to 950 tok/s per H100 on 70B FP8 with continuous batching.
Wall-clock fraction of peak achieved over the year. Hyperscaler inference fleets typically run 50 to 70 percent.
Ratio of total data-center power to IT power. Hyperscale 2024 average is around 1.18 to 1.22 (Uptime Institute).
Nameplate TDP of one accelerator. H100 SXM 700W, H200 700W, B200 1000W, GB200 NVL72 1200W per GPU, TPU v5p 450W, Trillium 350W, Trainium 2 600W.
EIA industrial electricity averaged 8.3 cents per kWh in 2024; long-term hyperscaler PPAs price closer to 5 to 7 cents.
Street prices: H100 around 30k to 35k, H200 around 38k to 42k, B200 around 45k to 50k, GB200 NVL72 fully loaded around 3M (allocate 41.7k per GPU). TPUs and Trainium not retail; use cloud-equivalent capex.
Hyperscalers extended depreciation to 6 years in 2023, but inference workloads at the frontier still cycle GPUs out faster.

Outputs

Effective tokens per second
510
Per accelerator, after utilization
Tokens per accelerator per year
16.1B
8,760 hours times effective tok/s
Energy cost per million tokens
$0.10
Watts times PUE times energy price
Depreciation cost per million tokens
$0.54
Capex divided by lifetime tokens
Total compute cost per million tokens
$0.64
Energy plus depreciation, output side
Implied gross margin vs Claude Sonnet list
95.7%
At $15 per 1M output tokens

API list price comparison

  • GPT-4o output: $5.00 per 1M (your stack: ratio computed below)
  • Claude Sonnet 4.5 output: $15.00 per 1M
  • Gemini 2.5 Pro output: $5.00 per 1M (under 200k context)
  • Claude Haiku 4.5 output: $1.00 per 1M
  • GPT-4o mini output: $0.60 per 1M

Sources: published rate cards for OpenAI, Anthropic, and Google (verified against vendor pricing pages, Q1 2026). Output-token pricing only; input tokens are roughly 3x to 5x cheaper.

Formulas

  1. throughput_scale = (70 / model_params_B) (linear approximation, FP8)
  2. effective_tps = h100_baseline_tps * throughput_scale * hw_throughput_multiplier * utilization
  3. tokens_per_gpu_year = effective_tps * 3600 * 24 * 365
  4. kw_per_gpu = (power_W * PUE) / 1000
  5. energy_cost_per_year = kw_per_gpu * 8760 * energy_price
  6. energy_cost_per_M_tokens = energy_cost_per_year / (tokens_per_gpu_year / 1e6)
  7. depr_per_year = capex / useful_life_years
  8. depr_per_M_tokens = depr_per_year / (tokens_per_gpu_year / 1e6)
  9. total_per_M_tokens = energy_per_M + depr_per_M
  10. gross_margin = 1 - (total_per_M_tokens / api_list_price_per_M)

Methodology

This is a unit-economics screen, not a full TCO model. It collapses the stack to two cost lines, energy and depreciation, because those two drive 70 to 85 percent of inference unit cost on modern accelerators. Network, cooling beyond PUE, racks, optical interconnect, RAM, and staff are excluded; they typically add 15 to 30 percent on top of the number printed here. Throughput scaling across model sizes is treated as inverse-linear in parameters, which holds for dense decoder-only models in the 8B to 405B range under FP8 with continuous batching; MoE architectures and speculative decoding can outperform this curve by 1.5x to 3x.

Hardware throughput multipliers, relative to H100 FP8 70B, are set from MLPerf Inference v4.1 and SemiAnalysis benchmarks: H200 1.35x (memory-bandwidth bound), B200 2.5x, GB200 per-GPU 3.1x (NVLink-72 scaling), TPU v5p 0.95x, Trillium 1.40x, Trainium 2 1.10x. These are order-of-magnitude defaults; production numbers vary with model architecture, batch size, sequence length, and software stack.

For the full Compute Cost Curve framework with six-step decomposition across acquisition, power, cooling, network, depreciation, and idle, plus model-architecture overlays and scenario tables, see the methodology one-pager. Adjacent reading: AI infrastructure insights and the AI and data economics practice.

Sources for default values and benchmarks: MLPerf Inference v4.1 (MLCommons, 2024); SemiAnalysis quarterly compute cost notes; EIA Form EIA-861 industrial electricity rates; Uptime Institute Global Data Center Survey 2024; vendor spec sheets (NVIDIA H100, H200, B200, GB200; Google TPU v5p, Trillium; AWS Trainium 2); Microsoft and Meta extended-depreciation 10-K disclosures (2023, 2024).