FlashAttention-4 turns Blackwell kernels into economics
FlashAttention-4 shows Blackwell-era AI economics will be shaped by attention kernel optimization and non-tensor bottlenecks, not FLOPs headlines alone.

When tensor cores scale faster than everything around them, the money moves to the kernel layer.
I think benchmark coverage is going to flatten FlashAttention-4 into the least interesting possible version of itself.
Yes, Together AI's launch post gives the headline numbers everyone wants to tweet: up to 1605 TFLOPs/s on Blackwell B200 with BF16, 71% utilization, forward-pass performance up to 1.3x faster than cuDNN 9.13, and as much as 2.7x faster than Triton. Those numbers matter. But the point is not that Blackwell is fast. We already had that memo.
The point is that Blackwell's tensor cores are scaling faster than nearby resources like shared-memory bandwidth and the special function units used for exponentials. Together frames the jump from Hopper H100 to Blackwell B200 as 1 to 2.25 PFLOPs of BF16 tensor-core throughput. When one part of the engine gets that much stronger while the surrounding plumbing does not, the bottleneck moves. Fast.
FlashAttention-4 shows where Blackwell actually slows down
That shift is the real story here. Attention is no longer limited mainly by how fast tensor cores can chew through the two GEMMs. More of the fight now happens in the ugly hallway around them: softmax, memory movement, and the parts of the kernel that are not blessed with giant marketing banners.
The original FlashAttention paper already pointed in this direction. Tri Dao and colleagues built the project around IO-awareness, not benchmark theater, and the argument was simple: the useful number is achieved work under real memory constraints, not the shiny peak number on the box. FlashAttention-4 keeps that same instinct and updates it for Blackwell's asymmetry.
The bottleneck moved into the hallway around the matmuls
Together says the forward pass is constrained by exponential throughput while the backward pass is constrained by shared-memory traffic. So FlashAttention-4 overlaps tensor-core work, softmax, and memory operations; it uses a software-emulated exponential path that spreads work across FMA units instead of leaning entirely on hardware MUFU instructions; and it uses Blackwell tensor memory plus 2-CTA MMA mode in the backward pass to reduce shared-memory pressure and cut atomic reduction overhead.

That may sound like kernel-insider business, but the economic consequence is simple enough: if the surrounding pipeline stalls, your expensive hardware waits. I keep coming back to the same analogy. Buying Blackwell and then starving it with weak neighboring resources is like buying a race car and driving it through school pickup traffic. The horsepower is real. The line is also real.
The upstream flash-attention repository now positions FlashAttention-4 as optimized for Hopper and Blackwell GPUs, which makes this look less like a one-off lab flex and more like operating-stack infrastructure.
Kernel work changes the bill, not just the chart
This is where the story turns from technical curiosity into economics. If a better attention kernel lets the same Blackwell fleet push more useful tokens, absorb more concurrency, or hit latency goals without overprovisioning, the hardware line item does not change but the business outcome does. That is why the piece belongs next to open-weight inference economics, not just next to benchmark recaps.

It also matches the broader pattern behind Meta's custom-silicon inference play. The leverage is drifting lower in the stack. Better accelerators still matter, obviously, but the teams that reclaim the non-tensor choke points get a second chance to improve the return on the same capex. That is serious leverage for something people still lazily file under "kernel optimization."
A useful extra tell: Together says some FlashAttention-4 techniques are being incorporated into cuDNN 9.13 and 9.14 with NVIDIA's help. That makes the release harder to dismiss as open-source chest thumping. It looks more like a preview of how Blackwell performance will actually be contested: by kernel teams, vendor libraries, framework maintainers, and inference vendors all crowding into the same narrow performance gap.
My take on the Blackwell economics race
I would still keep a cold eye on the biggest claims. These are author-reported benchmarks, with the strongest evidence on B200 BF16. Operators should want more data on mixed batches, shorter contexts, real serving traces, and the speed at which gains reach ordinary stacks. Otherwise this can age into a very impressive lab result with a narrower production footprint.
But I do think FlashAttention-4 clarifies something important. In the Blackwell era, the money is moving toward whoever keeps the tensor cores from standing around with their hands in their pockets. That is not as glamorous as a giant FLOPs headline. It is much closer to the bill.
Source file
Public source trail
These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.
Primary source for the FlashAttention-4 bottleneck analysis, benchmark claims on B200, and the forward/backward kernel design changes.
Upstream repository showing FlashAttention-4 availability and its stated optimization focus for Hopper and Blackwell GPUs.
Official architecture overview establishing Blackwell's positioning around inference scale, tensor-core advances, and rack-level AI infrastructure.
The original FlashAttention paper, useful for showing that the project has always been about memory movement and IO constraints rather than benchmark theater alone.

About the author
Lena Ortiz
Lena tracks the economics and mechanics behind AI systems, from serving architecture and open-weight deployment to developer tooling, platform shifts, product decisions, and the operational tradeoffs that shape what teams actually run. Her reporting is aimed at builders and operators deciding what to trust, adopt, and maintain.
- 24
- Apr 10, 2026
- Berlin
Archive signal
Reporting lens: Operating leverage beats ideological posturing.. Signature: If the cost curve moves, the product strategy moves with it.
Article details
- Category
- AI Infrastructure
- Last updated
- April 11, 2026
- Lead illustration
- The loud number is throughput. The strategic story is who can turn Blackwell's non-tensor choke points back into useful work.
- Public sources
- 4 linked source notes
Byline

Covers the economics, tooling, and operating realities that shape how AI gets built, shipped, and run.



