AI compute and energy 2026-04-26 12 min read

The Custom Silicon Insurgency Against Nvidia in 2026

AWS Trainium 2, Google TPU v5p and Trillium, Microsoft Maia, Meta MTIA, and a possible OpenAI ASIC are reshaping where AI compute margin lives, but the binding constraint sits one layer down at HBM and CoWoS-L.

The hyperscalers spent 2024 and 2025 telling investors that custom silicon would relieve their dependence on Nvidia, and 2026 is the year those claims start meeting empirical scrutiny. AWS has stood up Project Rainier for Anthropic at a publicly disclosed scale of more than one million Trainium 2 chips. Google has placed its TPU v5p generation into broad external availability through Google Cloud and is shipping the inference oriented Trillium (TPU v6e), with Anthropic running a meaningful share of frontier training and inference on TPU alongside Trainium. Microsoft is ramping Maia 100 inside its own data centers for OpenAI and first party workloads, Meta has fielded MTIA v2 for ranking and recommendation, and OpenAI has confirmed silicon design discussions with Broadcom. The financial question is narrower than the headlines suggest: how much of Nvidia's roughly seventy five percent gross margin on data center GPUs (per the FY2025 10-K) can custom ASICs actually capture, and over what time horizon. The answer turns less on chip design than on three upstream chokepoints, HBM3e and HBM4 capacity at SK hynix, Samsung, and Micron, advanced packaging at TSMC under the CoWoS-L roadmap, and the CUDA software estate. This brief sizes the displacement, benchmarks cost per token across six accelerators on a representative inference workload, and identifies where the margin actually moves.

The displacement thesis and what it actually claims #

Three claims are usually bundled together in the custom silicon narrative, and untangling them changes the conclusion. The first claim is that hyperscalers can build accelerators that match or exceed Nvidia on performance per dollar for their own workloads. The second is that they will redirect a meaningful share of capex from Nvidia GPUs to internal silicon. The third is that this will compress Nvidia's margin structure. Only the first two have strong evidence in 2026. The third is contingent on whether HBM and packaging supply expand fast enough to make custom ASICs widely deployable rather than capacity rationed.

The capex direction is unambiguous. AWS, Google, Microsoft, and Meta have collectively guided to more than three hundred billion dollars in 2026 capital spending across their most recent quarterly disclosures, with AI infrastructure accounting for the majority of incremental growth. AWS executives at re:Invent 2024 and through 2025 framed Trainium 2 as the default accelerator for Anthropic's frontier training, with Project Rainier scaled at more than one million chips across multiple sites. Google has moved TPU v5p into external general availability and is selling Trillium TPU v6e capacity to enterprise customers. Microsoft has publicly committed to a multi generation Maia roadmap and is operating Maia 100 alongside Nvidia H100 and B200 in production.

What is not yet visible in the financial statements is significant Nvidia margin compression. Nvidia's data center revenue continued to grow through fiscal 2025 and 2026 reporting, and Wells Fargo and Morgan Stanley analyst notes in early 2026 still model data center gross margin in the low to mid seventy percent range. The plausible explanation is that custom silicon is taking the marginal workload rather than the installed base, that the AI compute market is growing fast enough to absorb both, and that HBM and CoWoS-L capacity constraints have throttled how quickly the hyperscalers can substitute even when they want to.

Hyperscaler silicon: what each program actually ships #

AWS Trainium 2 is the most aggressively scaled custom training accelerator in production. Disclosed specifications place each chip at roughly 1.3 petaflops of dense BF16 with 96 GB of HBM, packaged into UltraServers of 64 chips with NeuronLink for collective operations. Project Rainier, the Anthropic dedicated cluster announced in late 2024 and built out through 2025 and 2026, is publicly described by AWS as more than one million Trainium 2 chips across multiple sites, structured to support frontier scale pretraining without a hard dependency on Nvidia allocation. Inferentia 3, the inference oriented sibling, is positioned at lower memory bandwidth but higher chips per server for cost sensitive serving.

Google's TPU program is the longest running and the only one with a comparable software stack to CUDA in maturity, through XLA and JAX. TPU v5p, the training tier, ships with 95 GB of HBM per chip and scales to pods of 8,960 chips with 4,800 Gbps optical interconnect on the inter chip network. Trillium, marketed as TPU v6e, is Google's inference and mid scale training tier with substantially better performance per watt than v5e and broad availability through Google Cloud through 2026. Anthropic's disclosed dual stack approach uses TPU v5p and Trillium for a meaningful share of training and inference workloads alongside Trainium 2, a deliberate hedge against single supplier risk.

Microsoft Maia 100 is in production for OpenAI and Microsoft first party workloads, with Maia 200 publicly previewed at Microsoft Build 2025 as the next generation. Meta's MTIA v2 powers ranking, recommendation, and content understanding workloads at fleet scale across Facebook, Instagram, and Threads, and Meta's Q4 2025 earnings call confirmed expanded MTIA deployment for generative inference. OpenAI's silicon work with Broadcom, first reported in 2024 and progressively confirmed through 2025 and 2026, targets a custom inference accelerator on TSMC advanced nodes with first silicon expected in the 2026 to 2027 window. Cerebras and Groq, both startups, are not displacing Nvidia at hyperscale but have carved a defensible niche in low latency inference where their architectural choices, wafer scale at Cerebras and deterministic compilation at Groq, deliver step function improvements on specific workloads.

Cost per token: a representative inference workload #

The most useful unit of comparison for hyperscaler economics is cost per million output tokens on a representative inference workload, holding model size and quality target constant. The table below compiles vendor disclosures and SemiAnalysis published estimates for a 70 billion parameter dense transformer running batched serving at 1024 token context with 256 token output, deployed in production class clusters with realistic utilization. The numbers are wide ranges rather than point estimates because amortization assumptions, power cost, and utilization swing the result by a factor of two or more. The structural pattern matters more than any single cell.

Three patterns are robust across the range. First, custom inference ASICs (Trainium 2, Trillium TPU v6e, MTIA v2) come in materially below H100 on cost per token at hyperscaler scale, with the gap widest where the workload is well matched to the accelerator's memory bandwidth profile. Second, Nvidia B200, the Blackwell training and inference part, narrows the gap meaningfully against custom silicon because the 192 GB HBM3e configuration and improved FP8 throughput shift the bottleneck back toward compute density where Nvidia retains a process and architecture lead. Third, the differentiated B300 part, with higher HBM3e per package and a refresh cadence aimed squarely at long context inference, is positioned to defend the most profitable inference segment through 2026 and into 2027.

The crossover where custom ASICs beat Nvidia on TCO is real but narrower than headlines imply. It is sharpest for inference workloads with stable token mix, predictable demand, and a tenant with the engineering budget to optimize against the ASIC's compiler. It is weakest for training workloads with rapidly changing model architectures, where CUDA's kernel ecosystem and the breadth of community implementations still saves engineering quarters. This is why even Anthropic, with deep TPU and Trainium investment, continues to consume Nvidia GPU capacity for specific workloads.

Accelerator	HBM per chip	Process node	Cost per million output tokens (USD)	Primary deployment
Nvidia H100 SXM	80 GB HBM3	TSMC 4N	0.55 to 0.95	Broad: every hyperscaler and neocloud
Nvidia B200	192 GB HBM3e	TSMC 4NP, CoWoS-L	0.30 to 0.55	Hyperscalers and frontier labs in 2026
Google TPU v5p	95 GB HBM3	TSMC N5	0.25 to 0.45	Google internal, Google Cloud, Anthropic
AWS Trainium 2	96 GB HBM3	TSMC N5 family	0.22 to 0.42	AWS internal, Anthropic Project Rainier
Meta MTIA v2	128 GB HBM3	TSMC N5	0.20 to 0.40	Meta ranking, recommendation, gen inference
Microsoft Maia 100	64 GB HBM2e	TSMC N5	0.35 to 0.60	Azure first party, OpenAI

Estimated cost per million output tokens, 70B dense inference, 2026 production clusters (USD, range)

The HBM bottleneck: why memory not logic is the gating factor #

Every accelerator in the table above lives or dies on high bandwidth memory. HBM3e in 24 GB stack form factors entered volume production in 2024, with 36 GB stacks ramping through 2025 and into 2026. SK hynix retains the largest share of qualified HBM3e supply, particularly for Nvidia, with Samsung accelerating qualification through 2025 and Micron now in commercial supply for B200 class systems. SK hynix and Samsung have both publicly committed to HBM4 sampling in 2025 and volume in 2026, with the new generation introducing logic die customization and base die process upgrades that materially change the supplier interface.

The constraint has shifted from a memory shortage to a packaging coupled shortage. HBM die alone are not the bottleneck in 2026, but HBM die that can be co packaged with logic on TSMC CoWoS-L substrates at the volumes the hyperscalers need is. CoWoS-L, the L variant using a local silicon interconnect rather than a full silicon interposer, is what enables the largest reticle sized accelerators with eight or more HBM stacks per package. TSMC has guided to roughly doubling CoWoS-L capacity from 2024 to 2026 and to further expansion in 2027, but every incremental wafer is being allocated quarters in advance.

This is the reason the displacement thesis cannot move at narrative speed. Even if AWS, Google, and Microsoft wanted to halve their Nvidia orders tomorrow, the alternative paths route through the same HBM and CoWoS-L queue. Nvidia has the longest standing commercial relationships with both SK hynix and TSMC and the deepest committed orderbook, which means it absorbs supply expansions ahead of new entrants. The custom silicon programs are growing fast precisely because the hyperscalers are willing to take longer lead times and second source HBM in exchange for not paying Nvidia's gross margin.

The CUDA moat in 2026: still real, narrower than 2023 #

CUDA remains Nvidia's most durable structural advantage, but the moat is narrower than it was at the GPT-4 launch. Three forces have eroded it. First, the major model frameworks (PyTorch, JAX, vLLM, TensorRT-LLM) now compile to multiple backends with workable performance, and the engineering effort to port a frontier model to TPU or Trainium has fallen from quarters to weeks for teams with sufficient depth. Second, the inference runtime layer, where most production AI cost lives, has commoditized faster than the training stack: vLLM and SGLang both run on multiple accelerators with competitive performance, and ASIC vendors have invested heavily in compiler quality. Third, hyperscalers can absorb porting cost in a way that smaller customers cannot, which means custom silicon eats the workloads with the highest absolute compute spend first.

The moat is widest where the workload is small, the team is generalist, and the model architecture is changing rapidly. This describes most enterprise AI deployment in 2026, which is why Nvidia's broad customer base outside hyperscale continues to consume H100 and L40S class hardware at high prices. The moat is narrowest where the workload is at hyperscale, the engineering team is specialized, and the model architecture is stable across many serving quarters. This describes Anthropic on Trainium 2, OpenAI on Microsoft Maia and (prospectively) Broadcom silicon, and Meta on MTIA. The intermediate zone, neoclouds like CoreWeave, Lambda, and Crusoe, remains structurally Nvidia loyal because their differentiation is exactly the breadth of CUDA software support.

MLPerf v4.1 and v5.0 inference results released in late 2025 and early 2026 show TPU v5p, Trainium 2, and B200 within roughly a factor of two of each other on most submitted benchmarks, with B200 retaining an edge on the largest models and the others closing on inference oriented submissions. The benchmark itself understates the production gap because it measures peak performance under ideal conditions, and the production gap is wider once real workload mix, software maturity, and operations cost are included. But the direction is consistent: the silicon performance gap is no longer the dominant variable. Software, supply, and contracts are.

Where the margin actually moves #

Three margin pools are in play. The first is Nvidia's gross margin on incremental data center GPU revenue, which was roughly seventy five percent in fiscal 2025 per the 10-K. Custom silicon does compress this at the margin, but the path runs through pricing on next generation parts (B300, Rubin) rather than a write down on installed H100 and B200 capacity. Wells Fargo and Morgan Stanley analyst models in early 2026 broadly assume that Nvidia data center gross margin holds in the low seventy percent range through 2026 and 2027, with downside scenarios closer to mid sixty percent if HBM and CoWoS-L supply normalize faster than demand.

The second margin pool is the hyperscaler operating margin on AI services. AWS, Google Cloud, and Azure each disclose AI revenue contributions in growing detail, and the share of that revenue served on custom silicon directly raises gross margin on the service. Goldman Sachs and Morgan Stanley have separately estimated that running a workload on Trainium 2 versus an H100 reservation can lift hyperscaler gross margin on that workload by 15 to 25 percentage points at current pricing, before any second order effect on retention through differentiated pricing. This is the single largest reason the hyperscalers are willing to spend on silicon teams.

The third margin pool is at TSMC and the HBM suppliers. CoWoS-L is the rate limiting step on every accelerator in the market, and TSMC's pricing power on advanced packaging has expanded materially since 2023. SK hynix HBM3e and HBM4 ASPs have followed a similar trajectory, with disclosures pointing to per gigabyte pricing well above legacy DRAM and tight bilateral pricing on HBM4 customization. The structural conclusion for 2026 and 2027 is that the cleanest exposure to the AI compute buildout is upstream of the accelerator itself, and that whether Nvidia or a hyperscaler captures any given dollar of margin matters less than whether that dollar shows up at TSMC and SK hynix on the way through.

Implications for buyers, builders, and policymakers #

For enterprise buyers and builders, the practical implication is that single accelerator strategies are riskier in 2026 than in 2024. The leading frontier labs (Anthropic, OpenAI) are explicitly multi accelerator, and Anthropic's TPU plus Trainium dual stack is a deliberate operational hedge against capacity allocation and pricing risk at any single supplier. Enterprises building production AI stacks should plan for at least two accelerator backends in the inference path within three years, write cloud contracts that preserve optionality rather than locking into instance families, and avoid CUDA specific kernels in the hot path of new inference systems. The compiler maturity gap between CUDA and the leading alternatives has closed enough that targeting PyTorch, JAX, vLLM, or SGLang directly preserves the ability to migrate workloads as TCO inverts. For training, CUDA remains the path of least resistance for most teams, but Anthropic's experience shows that frontier scale training on non Nvidia silicon is operationally feasible with sufficient engineering investment.

For policymakers and capital allocators, the binding constraint is energy and packaging, not silicon design. OpenAI Stargate Phase 1, the Texas data center buildout funded by Oracle, SoftBank, MGX, and OpenAI itself, is structured around securing power, land, and HBM coupled packaging slots rather than a particular accelerator vendor. The United States, Korea, Japan, and Taiwan are the four jurisdictions where the upstream supply expansion is concentrated, and policy attention focused exclusively on the visible accelerator layer misses where the actual bottleneck and the actual margin currently sit.

Sources #

Cite this brief

@misc{hossen2026customsiliconvsnvidia2026,
  author = {Hossen, Md Deluair},
  title  = {The Custom Silicon Insurgency Against Nvidia in 2026},
  year   = {2026},
  url    = {https://deluair.com/consultancy/insights/custom-silicon-vs-nvidia-2026},
  note   = {Deluair Consultancy briefs}
}

Hossen, M. D. (2026). The Custom Silicon Insurgency Against Nvidia in 2026. Deluair Consultancy briefs. https://deluair.com/consultancy/insights/custom-silicon-vs-nvidia-2026

Hossen, Md Deluair. "The Custom Silicon Insurgency Against Nvidia in 2026." Deluair Consultancy briefs, 2026-04-26. https://deluair.com/consultancy/insights/custom-silicon-vs-nvidia-2026.

Related insights

Adjacent reading.

AI compute and energy