Skip to main content

Signed reporting on research turns, product fights, policy pressure, and infrastructure bets worth paying attention to after the frenzy burns off.

Edition briefFour desks/Cross-desk archives/Machine-readable discovery
InfrastructureByline / INFRA_03
Published March 22, 2026

FlashAttention-4 makes Blackwell kernel work an economics story

FlashAttention-4 shows Blackwell-era AI economics will be shaped by attention kernel optimization and non-tensor bottlenecks, not FLOPs headlines alone.

Lena OrtizInfrastructure Correspondent7 min read
When tensor cores scale faster than everything around them, the money moves to the kernel layer.
Editorial illustration of a Blackwell server aisle where wide tensor-compute lanes narrow into shared-memory and softmax bottlenecks before a tuned attention pipeline opens the flow again.
InfrastructureCover / INFRA_03

Lead illustration

FlashAttention-4 makes Blackwell kernel work an economics story
Cover / INFRA_03The loud number is throughput. The strategic story is who can turn Blackwell's non-tensor choke points back into useful work.AI-generated editorial illustration.

Benchmark headlines are going to miss the point on FlashAttention-4.

The loud claim from Together AI's launch post is easy to repeat: on Blackwell B200 with BF16, FlashAttention-4 reaches up to 1605 TFLOPs/s, or 71% utilization, with forward-pass performance up to 1.3x faster than cuDNN 9.13 and as much as 2.7x faster than Triton. Those are real numbers, and they matter. But if you stop there, you reduce the story to benchmark flex.

The more interesting line in the post is the one most launch coverage will rush past. The authors argue that from Hopper H100 to Blackwell B200, BF16 tensor-core throughput jumps from 1 to 2.25 PFLOPs, while shared-memory bandwidth and the number of special function units for operations like exponentials do not scale in step. That changes the shape of the problem. Attention is no longer governed mainly by how fast tensor cores can chew through the two GEMMs. More of the fight now sits in the plumbing around them.

That is why FlashAttention-4 matters. Not because it proves Blackwell is fast. Everyone already expected Blackwell to be fast. It matters because it shows Blackwell-era AI economics will be shaped by teams that can reclaim the non-tensor bottlenecks that raw hardware marketing leaves behind.

FlashAttention was always about the memory bill

The original FlashAttention paper in 2022 made its case through IO-awareness. Tri Dao and colleagues argued that ordinary attention wastes time and memory shuttling data between high-bandwidth memory and on-chip SRAM, and that better tiling could reduce those reads and writes without approximating the operation away. That was already a clue about where the project was heading. FlashAttention was never just a faster kernel. It was a reminder that the useful unit of performance is not advertised FLOPs but achieved work under real memory constraints.

FlashAttention-4 keeps the same instinct and updates it for Blackwell. In Together's telling, the forward pass is constrained by exponential throughput, while the backward pass is constrained by shared-memory traffic. So the kernel does not simply accelerate matrix math harder. It tries to overlap matrix math with those other bottlenecks, route more work through otherwise underused resources, and keep intermediate state closer to where the next stage needs it.

The concrete tricks matter here. The post describes a forward pipeline that overlaps tensor-core work, softmax, and memory operations; a software-emulated exponential path that spreads work across FMA units instead of leaning entirely on hardware MUFU instructions; and a backward pass that uses Blackwell tensor memory and 2-CTA MMA mode to reduce shared-memory pressure and cut atomic reduction overhead. The upstream flash-attention repository now presents FlashAttention-4 as optimized for Hopper and Blackwell GPUs, which tells you this is not a one-off internal stunt. It is becoming part of the operating stack.

Editorial diagram showing wide tensor-compute lanes on Blackwell narrowing into shared-memory and softmax bottlenecks before a tuned attention pipeline restores flow.
Figure / 01 FlashAttention-4's point is simple: Blackwell made tensor math faster than several neighboring resources, so the performance fight moved into the spaces between the matmuls.

This is the piece many GPU-buying narratives still flatten. The architecture headline is enormous compute. The operating question is whether the rest of the kernel can keep up. If the exponential path or shared-memory traffic stalls the lane, those shiny tensor cores sit there waiting. And idle premium hardware is not a technical embarrassment first. It is a cost problem.

Why this is an inference-economics story

That last point is where FlashAttention-4 becomes more than a kernel story. For inference teams, the unit that hurts is not theoretical peak performance. It is latency, throughput, and utilization at the serving layer. If a better attention kernel lets the same Blackwell fleet push more useful tokens, absorb higher concurrency, or hit latency targets without overprovisioning, then the economics change even if the hardware bill does not.

That is why this story belongs beside our earlier look at open-weight inference economics. Hardware choice matters, but the realized cost curve is shaped by stack discipline: scheduler quality, batching, KV-cache handling, library maturity, and now increasingly by kernel teams that know how to squeeze around architectural asymmetries. The same rack can look expensive or efficient depending on how much non-matmul waste survives in the path.

It also fits with the broader infrastructure pattern behind Meta's custom-silicon inference play. The strategic edge is moving lower in the stack. You can no longer assume that buying better accelerators ends the conversation. If Blackwell's tensor cores keep scaling faster than nearby resources, then whoever solves the surrounding bottlenecks gets a second bite at the same capex. That is powerful. It means kernel work can behave like infrastructure leverage, not just like optimization garnish.

Editorial diagram showing model requests flowing through a kernel layer that reshapes latency, utilization, and cost before workloads reach Blackwell GPU racks.
Figure / 02 The hardware line item does not change, but a better kernel can still move latency, utilization, and cost per useful token.

There is an uncomfortable implication here for anyone still reading GPU markets through pure FLOPs headlines. The marginal winner may not be the company with the biggest benchmark chart. It may be the team that can turn awkward hardware asymmetry into higher sustained utilization in production. That team might be an accelerator vendor. It might be a library team. It might be an inference platform vendor. But the locus of value is clearly drifting toward software-hardware co-design, which is why so much of the publication's best work now ends up under the infrastructure desk.

FlashAttention-4 also says something about platform power

Another revealing detail in the blog post is that the authors say they worked with the cuDNN team to incorporate some FlashAttention-4 techniques into cuDNN starting with versions 9.13 and 9.14. That makes the release harder to read as simple open-source one-upmanship. It looks more like a preview of how Blackwell performance will actually be competed over: open kernels, vendor libraries, framework teams, and inference-stack builders all trying to close the same gap from different sides.

That matters for procurement and platform strategy. If the value is moving into co-design, then access to hardware is only the opening move. Real advantage comes from how quickly your software stack can absorb new kernel ideas, how much profiling talent sits near production workloads, and how dependent you are on a single vendor library cadence. Readers of our NVIDIA telecom AI-grid analysis have already seen the rack-scale version of this argument. Availability and interconnect matter. But so does the ability to convert installed hardware into economically useful service.

There is also a caution flag. FlashAttention-4's most dramatic claims are benchmark claims from the authors themselves, and the strongest evidence in the post is on B200 BF16. Operators should want to see more data on real serving traces, shorter sequence lengths, mixed batch shapes, and the speed at which these gains filter into mainstream stacks. The economics story is strongest when the optimization survives ordinary mess, not just a carefully chosen benchmark lane.

What to watch after the excitement fades

The next proof point is not another giant utilization number. It is whether FlashAttention-4 style gains show up broadly enough to change fleet planning, pricing assumptions, or deployment choices for LLM inference performance. If they do, the article will age well. If they do not, then it remains an impressive kernel paper with a narrower blast radius.

Either way, the release already clarified something important about the Blackwell era. The easy story says faster tensor cores will decide the market. The harder, more useful story says the market will be shaped by who can keep those tensor cores from waiting on everything around them. FlashAttention-4 is a sharp piece of evidence for the harder story. And in infrastructure, the harder story is usually the one that ends up on the bill.

Source file

Public source trail

These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.

Primary sourcetogether.aiTogether AI
FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Primary source for the FlashAttention-4 bottleneck analysis, benchmark claims on B200, and the forward/backward kernel design changes.

Primary sourcegithub.comGitHub
Dao-AILab/flash-attention

Upstream repository showing FlashAttention-4 availability and its stated optimization focus for Hopper and Blackwell GPUs.

Primary sourcenvidia.comNVIDIA
NVIDIA Blackwell Architecture

Official architecture overview establishing Blackwell's positioning around inference scale, tensor-core advances, and rack-level AI infrastructure.

Primary sourcearxiv.orgarXiv
Fast and Memory-Efficient Exact Attention with IO-Awareness

The original FlashAttention paper, useful for showing that the project has always been about memory movement and IO constraints rather than benchmark theater alone.

Portrait illustration of Lena Ortiz

About the author

Lena Ortiz

Infrastructure Correspondent

View author page

Lena tracks the economics and mechanics of AI infrastructure: GPU constraints, serving architecture, open-weight deployment, latency pressure, and cost discipline. Her reporting is aimed at builders deciding what to run, not spectators picking sides.

Published stories
4
Latest story
Mar 22, 2026
Base
Berlin · Systems desk

Reporting lens: Operating leverage beats ideological posturing.. Signature: If the cost curve moves, the product strategy moves with it.

Related reads

More reporting on the same fault line.

Infrastructure/Mar 21, 2026/5 min read

Meta’s custom-silicon sprint is really an inference power play

Meta’s four-chip MTIA roadmap and its 6GW AMD pact point to the same goal: cheaper inference, tighter stack control, and less dependence on one GPU supplier.

Infrastructure/Mar 20, 2026/6 min read

NVIDIA AI grids turn telcos into inference resellers

NVIDIA's AI grid pitch turns telecom networks into distributed inference sellers, but operators still need products developers and buyers will actually use.

Editorial illustration of a telecom tower radiating distributed inference lanes across nearby edge sites, roads, devices, and city infrastructure.
InfrastructureStory / INFRA_03

Lead illustration

NVIDIA AI grids turn telcos into inference resellersRead NVIDIA AI grids turn telcos into inference resellers
Story / INFRA_03The AI-grid pitch is really a plan to turn the telecom footprint into sellable inference capacity.
Infrastructure/Mar 13, 2026/7 min read

Open-weight model inference economics for lean teams

Open-weight inference economics hinge on utilization, latency, privacy, and operational control, not just sticker price or ideology about self-hosting.

Editorial illustration of a serving stack with model weights, GPU capacity, utilization lines, and cost panels arranged across a dark infrastructure grid.
InfrastructureStory / INFRA_03

Lead illustration

Open-weight model inference economics for lean teamsRead Open-weight model inference economics for lean teams
Story / INFRA_03The economics of open-weight serving are decided by utilization and operations, not ideology alone.
FlashAttention-4 makes Blackwell kernel work an economics story | AI News Silo