Compute cost per token calculator.
Build a bottoms-up TCO for an inference fleet and translate it into dollars per million output tokens. Inputs cover hardware throughput, utilization, power, PUE, energy price, capex, and useful life. Outputs decompose the cost stack and benchmark against published API list pricing. All math runs in your browser.
Outputs
API list price comparison
- GPT-4o output: $5.00 per 1M (your stack: ratio computed below)
- Claude Sonnet 4.5 output: $15.00 per 1M
- Gemini 2.5 Pro output: $5.00 per 1M (under 200k context)
- Claude Haiku 4.5 output: $1.00 per 1M
- GPT-4o mini output: $0.60 per 1M
Sources: published rate cards for OpenAI, Anthropic, and Google (verified against vendor pricing pages, Q1 2026). Output-token pricing only; input tokens are roughly 3x to 5x cheaper.
Formulas
- throughput_scale = (70 / model_params_B) (linear approximation, FP8)
- effective_tps = h100_baseline_tps * throughput_scale * hw_throughput_multiplier * utilization
- tokens_per_gpu_year = effective_tps * 3600 * 24 * 365
- kw_per_gpu = (power_W * PUE) / 1000
- energy_cost_per_year = kw_per_gpu * 8760 * energy_price
- energy_cost_per_M_tokens = energy_cost_per_year / (tokens_per_gpu_year / 1e6)
- depr_per_year = capex / useful_life_years
- depr_per_M_tokens = depr_per_year / (tokens_per_gpu_year / 1e6)
- total_per_M_tokens = energy_per_M + depr_per_M
- gross_margin = 1 - (total_per_M_tokens / api_list_price_per_M)
Methodology
This is a unit-economics screen, not a full TCO model. It collapses the stack to two cost lines, energy and depreciation, because those two drive 70 to 85 percent of inference unit cost on modern accelerators. Network, cooling beyond PUE, racks, optical interconnect, RAM, and staff are excluded; they typically add 15 to 30 percent on top of the number printed here. Throughput scaling across model sizes is treated as inverse-linear in parameters, which holds for dense decoder-only models in the 8B to 405B range under FP8 with continuous batching; MoE architectures and speculative decoding can outperform this curve by 1.5x to 3x.
Hardware throughput multipliers, relative to H100 FP8 70B, are set from MLPerf Inference v4.1 and SemiAnalysis benchmarks: H200 1.35x (memory-bandwidth bound), B200 2.5x, GB200 per-GPU 3.1x (NVLink-72 scaling), TPU v5p 0.95x, Trillium 1.40x, Trainium 2 1.10x. These are order-of-magnitude defaults; production numbers vary with model architecture, batch size, sequence length, and software stack.
For the full Compute Cost Curve framework with six-step decomposition across acquisition, power, cooling, network, depreciation, and idle, plus model-architecture overlays and scenario tables, see the methodology one-pager. Adjacent reading: AI infrastructure insights and the AI and data economics practice.
Sources for default values and benchmarks: MLPerf Inference v4.1 (MLCommons, 2024); SemiAnalysis quarterly compute cost notes; EIA Form EIA-861 industrial electricity rates; Uptime Institute Global Data Center Survey 2024; vendor spec sheets (NVIDIA H100, H200, B200, GB200; Google TPU v5p, Trillium; AWS Trainium 2); Microsoft and Meta extended-depreciation 10-K disclosures (2023, 2024).