Deep dive · 2026-06-29 · OpenAI taped out a custom inference chip in nine months. The question isn’t whether it works — it’s when owning the silicon actually pays, and who keeps the savings.
On June 24, OpenAI and Broadcom unveiled Jalapeño, a custom chip built from scratch for one job: running large language models. OpenAI says it designed the thing in nine months — the fastest high-performance ASIC cycle it’s aware of — and that early silicon delivers performance per watt “substantially better” than the state of the art. First deployment is end of 2026, at gigawatt scale. Microsoft has reportedly committed to buy 40% of the first production run.
The official framing is “owning the full stack.” OpenAI’s argument is that controlling everything from chip to product lets it run models faster, more reliably, and cheaper. That’s true, and it’s also the polite version. The blunt version is a single number.
Nvidia keeps roughly 70 cents of gross margin on every dollar of GPU it sells. If you are buying GPUs to serve inference at gigawatt scale, that margin is a tax you pay forever, on the workload that defines your cost base. Building your own chip is how you stop paying it.
So here’s the thesis. Every serious AI lab is becoming a chip company, and they are all making the same bet: that inference is now a big enough, stable enough workload to justify freezing it into silicon and clawing back Nvidia’s margin. The bet is sound for the handful of players with the volume to fill a custom chip. But it is still a bet — on the workload not changing, on having enough demand, and on routing around the part of Nvidia’s moat that was never the GPU. And even when the bet pays, the savings accrue to the platform, not automatically to you.
Why inference, why now
Start with where the money goes. Training a frontier model is a spike — enormous, but finite, and it happens a few times a year. Inference is the opposite: a small cost per request, multiplied by a billion users running it all day, every day, forever. In aggregate spend, inference now dwarfs training. The meter runs on the serving side.
We’ve made this argument from the business angle for a month. The labs metered their APIs and split their subscriptions because inference is the product, not a free sample. The channel dive argued the moat is the rail you serve on, not the weights. The chip is simply the deepest layer of that rail. If inference is the business, the cost of inference is the business’s gross margin — and right now a 70%-margin slice of that cost belongs to Nvidia.
That’s the structural pressure. Every token you serve runs on a GPU you bought at a markup designed to make Nvidia one of the most profitable companies on earth. At small scale you eat it; the GPU’s flexibility is worth the premium. At gigawatt scale, the premium becomes the single largest line item you don’t control. Custom silicon is the move that turns Nvidia’s margin into your own.
This has happened before, and it worked
The template is a decade old. Google built the first Tensor Processing Unit not to compete with Nvidia but to survive its own success.
The story Google tells is a 2013 calculation. If Android users adopted voice search at the scale Google expected — three minutes a day, each — serving the speech models would have required doubling the company’s entire global data center footprint. You cannot buy your way out of that with general-purpose hardware. So Google built a chip that did one thing. TPU v1 went into production in 2015 and was announced at I/O 2016, inference-only. It was, in the most literal sense, built to save Google from its own AI demand.
The numbers explain why specialization wins. A GPU is a general-purpose parallel processor carrying what one analysis calls “architectural baggage” — it was designed to render game textures and run scientific simulations, and that generality costs silicon. On the original TPU, more than 90% of the silicon did useful matrix math, versus roughly 30% on a contemporary GPU. Google reported 15–30× higher performance and 30–80× better performance per watt than the CPUs and GPUs of the day.
The catch is in the same sentence. A TPU’s systolic array can only multiply matrices efficiently. It cannot render graphics, browse the web, or run a spreadsheet. Google accepted total inflexibility because neural-network inference is matrix multiplication repeated billions of times, and it bet that would stay true. It did. Ten years later, Google’s TPUs are good enough that Anthropic — Google’s own model competitor — agreed to run on up to a million of them: the latest-generation “Ironwood” chip became the first custom ASIC to reach seven-figure deployment at a single customer.
The economics that make it rational
The reason this is suddenly everyone’s strategy, not just Google’s, is that the math now closes quickly.
A custom AI ASIC trades flexibility for roughly 3–5× better performance per watt on its target workload. Google claims its TPUs deliver 4.7× better cost-per-dollar on inference and 67% lower power. AWS makes a similar pitch for Trainium and Inferentia: up to 50% savings versus Nvidia inference, in a deliberate two-chip split between training and serving.
Against that, the development cost is smaller than it sounds. A custom chip program runs on the order of $300–500 million — and a hyperscaler running billions of inferences a day recoups that in under a year, purely by not paying Nvidia’s 70% margin on the GPUs it would otherwise rent. When the alternative is buying tens of billions of dollars of GPUs at full markup, half a billion in NRE is a rounding error with a fast payback. Morgan Stanley estimates custom ASICs will take 25% of the AI inference market this year, up from under 5% in 2023.
There’s a quiet winner in all of this: Broadcom. It is the design partner behind Google’s TPU, Meta’s MTIA, Microsoft’s Maia, and now OpenAI’s Jalapeño. The labs are going vertical, but they’re all renting the same arms dealer to do it. Broadcom sells the picks during the gold rush, and its custom-silicon business is the reason its networking-and-ASIC revenue is climbing as fast as it is.
Two strategies, and the interesting one is Anthropic’s
Watch how the labs split, because the divide tells you what they each believe.
OpenAI and Google build. Google has a decade and seven generations of TPUs and now previews an eighth that splits training and inference at TSMC’s 2nm node. OpenAI just taped out its first chip and is buying down its volume risk by pre-selling 40% to Microsoft. Both are betting that owning the silicon is worth the capital and the inflexibility.
Anthropic does not build a chip. It rents three. It has a $40 billion, up-to-5-gigawatt deal for up to a million Google TPUs; it will run more than a million Amazon Trainium2 chips via Project Rainier; and it still uses Nvidia. Anthropic went multi-silicon instead of vertical — the same model-portability logic this column has pushed all quarter, applied one layer down, to the hardware.
This is the genuinely interesting fork. Building your own chip is the maximalist margin play: capture all of Nvidia’s markup, eat all of the inflexibility risk. Renting across three silicon vendors is the hedge: you never capture the full margin, but you never bet your serving capacity on one frozen architecture or one supplier’s roadmap, and you can play TPU against Trainium against Nvidia on price. Given how badly Anthropic got burned this quarter by depending on a single point of failure it didn’t control — its own government — the multi-silicon posture looks less like indecision and more like a company that has internalized supplier risk the hard way.
The bear case, steelmanned
If building a chip were obviously correct, everyone would have done it years ago. Here is the case against, taken seriously, because it’s where the risk actually lives.
Inflexibility is a bet on the future of the workload. A custom ASIC is frozen matrix-multiplication hardware. It pays off only if the workload it was designed for is still the workload three years from now — because a nine-month design plus the ramp to volume means you are committing today to the shape of inference in 2028. If the dominant architecture shifts — diffusion-style language models, a new attention mechanism, a routing scheme that changes the compute profile of mixture-of-experts — a purpose-built chip is expensive and slow to retool, and Nvidia’s flexible GPU absorbs the change for free. The labs are betting that the transformer is stable infrastructure. So far it has been. That is not the same as a guarantee.
The chip was never the moat — CUDA was. Nvidia’s real lock-in is twenty years of software: CUDA, the libraries, the millions of developers who write against it. You can match Nvidia’s silicon and still lose because nothing runs on your silicon without a compiler stack to match. The hyperscalers route around this because they control their own software top to bottom — Google has XLA, OpenAI has Triton, everyone is converging on MLIR — and they only have to run their own models on their own chips, which is a far smaller compatibility surface than “every workload on earth.” But that’s exactly why this is a giants-only game. A startup can’t build a chip, because it can’t build the compiler, the kernels, and the captive workload to justify either. Custom silicon is not a democratizing force. It concentrates inference in the few companies big enough to own the whole stack.
The moat is migrating to networking. Nvidia saw the ASIC threat coming and moved the fight to where it’s still ahead: the interconnect. Modern inference doesn’t run on a chip; it runs on thousands of chips wired into one coherent fabric, and Nvidia’s NVLink lets GPUs share memory as if they were a single device. Its networking revenue hit $10.98 billion in a single quarter, up 263% year over year. A custom ASIC that’s brilliant per-chip still needs a fabric to scale, and Nvidia is now licensing NVLink to third parties precisely to stay the connective tissue even in a world of custom accelerators. You can beat the GPU and still be renting Nvidia’s network.
A chip only pays if you fill it. The whole economic case rests on volume — billions of inferences a day to amortize the design cost and keep the fab utilization high. Below that scale, the GPU’s flexibility wins and the ASIC is a stranded asset. OpenAI clears the bar with a billion users. The tell is Microsoft taking 40% of the first run: that’s OpenAI de-risking its own volume by selling the capacity it can’t yet guarantee it will fill. Even the company best positioned to do this is hedging the demand side.
What would make Jalapeño a success — and what would make it a press release
The honest read is that we don’t know yet, and the announcement is structured to make us feel like we do.
The perf-per-watt figures are self-reported, with no independent benchmarks and no detailed comparison. First-generation custom silicon has a long history of slipping its dates and under-delivering against launch specs — Google’s early TPUs and Amazon’s early Trainium both took generations to become genuinely competitive. Nine months to tape-out is genuinely fast, but tape-out is not production, and production is not production-at-the-perf-you-promised.
So watch three things, in order. First, does Jalapeño actually carry meaningful production traffic at scale by its end-2026 target, or does that date slip into 2027? Second, when independent numbers appear, do they hold up against the self-reported claims? Third — the real one — does OpenAI move serious inference off Nvidia and onto its own chip, or does Jalapeño end up as a negotiating chip that gets it a better price from Nvidia while Nvidia still serves the traffic? The threat of a custom ASIC is worth something even if the ASIC never ships at volume. Some of these programs are real silicon; some are leverage.
So what, if you ship products on these APIs
You don’t buy chips. But this changes the ground under your cost line in three ways worth holding onto.
The price of inference is going to keep falling for a structural reason. Not a promotion — a margin transfer. The companies selling you tokens are systematically removing Nvidia’s 70% markup from their own cost base. That puts durable downward pressure under inference pricing, the same direction as DeepSeek’s permanent floor but driven by hardware economics rather than strategic dumping. Plan your unit economics assuming the floor keeps dropping.
But the savings are the platform’s to keep. The lab captures Nvidia’s margin; whether it passes that to you depends on the same test as the price-cut dive: is inference the seller’s product or its complement? For OpenAI and Anthropic the token is the business, so expect them to bank a chunk of the chip savings as restored margin rather than hand it all to you. The lab that cuts inference cost is not your benefactor; it’s repairing its own gross margin, and you get the leftovers.
Multi-silicon is the new multi-cloud, and it has a portability tax. If you ever run your own inference — fine-tuned open weights on rented capacity — the chip is now part of the lock-in. Code tuned for CUDA doesn’t run free on a TPU or a Trainium pod; kernels, quantization, and serving stacks differ. Anthropic’s three-silicon hedge is the enterprise template, and it isn’t free: portability across accelerators costs real engineering, the same way model portability does. Budget for it before you need it, not after your one vendor raises prices.
And don’t confuse a chip announcement with a chip. The thing that matters — cheaper tokens in your bill — shows up when the silicon ships at volume and the savings get competed down, not when the press release runs. That’s a 2027 story at the earliest. The announcement is June 2026. Mind the gap.
The deeper pattern is the one this column keeps returning to. The AI business is integrating downward — model to harness to rail to chip — because every layer that someone else owns is a margin someone else collects. Two weeks ago the government decided it wanted a say in who gets the model. This week the labs decided they want the chip the model runs on. The frontier is being walled at the top and owned at the bottom, and the open middle — open weights, portable harnesses, rented silicon — is where whatever competition survives will have to live.
What would change my mind: if a non-hyperscaler — a lab or startup without a billion-user workload and its own compiler team — fields a custom inference chip that wins on independent benchmarks and meaningfully displaces Nvidia in production, then this isn’t a giants-only margin play and the moat I’ve described (volume + software + networking) is weaker than I think. And if the transformer-shaped workload these chips freeze into silicon gets displaced by a fundamentally different architecture within two years, the whole bet ages badly, the GPU’s flexibility turns out to have been worth the 70% after all, and the labs that rented three silicons will have been the smart ones.