FlashAttention-4 turns Blackwell kernels into economics

FlashAttention-4 shows Blackwell-era AI economics will be shaped by attention kernel optimization and non-tensor bottlenecks, not FLOPs headlines alone.

Filed Mar 22, 2026Updated Apr 11, 20264 min read

Filed March 22, 2026Updated April 11, 20264 min read

Editorial illustration of a Blackwell server aisle where wide tensor-compute lanes narrow into shared-memory and softmax bottlenecks before a tuned attention pipeline opens the flow again. — Cover / AI InfrastructureThe loud number is throughput. The strategic story is who can turn Blackwell's non-tensor choke points back into useful work.

When tensor cores scale faster than everything around them, the money moves to the kernel layer.

I think benchmark coverage is going to flatten FlashAttention-4 into the least interesting possible version of itself.

Yes, Together AI's launch post gives the headline numbers everyone wants to tweet: up to 1605 TFLOPs/s on Blackwell B200 with BF16, 71% utilization, forward-pass performance up to 1.3x faster than cuDNN 9.13, and as much as 2.7x faster than Triton. Those numbers matter. But the point is not that Blackwell is fast. We already had that memo.

The point is that Blackwell's tensor cores are scaling faster than nearby resources like shared-memory bandwidth and the special function units used for exponentials. Together frames the jump from Hopper H100 to Blackwell B200 as 1 to 2.25 PFLOPs of BF16 tensor-core throughput. When one part of the engine gets that much stronger while the surrounding plumbing does not, the bottleneck moves. Fast.

FlashAttention-4 shows where Blackwell actually slows down

That shift is the real story here. Attention is no longer limited mainly by how fast tensor cores can chew through the two GEMMs. More of the fight now happens in the ugly hallway around them: softmax, memory movement, and the parts of the kernel that are not blessed with giant marketing banners.

The original FlashAttention paper already pointed in this direction. Tri Dao and colleagues built the project around IO-awareness, not benchmark theater, and the argument was simple: the useful number is achieved work under real memory constraints, not the shiny peak number on the box. FlashAttention-4 keeps that same instinct and updates it for Blackwell's asymmetry.

The bottleneck moved into the hallway around the matmuls

Together says the forward pass is constrained by exponential throughput while the backward pass is constrained by shared-memory traffic. So FlashAttention-4 overlaps tensor-core work, softmax, and memory operations; it uses a software-emulated exponential path that spreads work across FMA units instead of leaning entirely on hardware MUFU instructions; and it uses Blackwell tensor memory plus 2-CTA MMA mode in the backward pass to reduce shared-memory pressure and cut atomic reduction overhead.

Editorial diagram showing wide tensor-compute lanes on Blackwell narrowing into shared-memory and softmax bottlenecks before a tuned attention pipeline restores flow. — Figure / 01FlashAttention-4's point is simple: Blackwell made tensor math faster than several neighboring resources, so the performance fight moved into the spaces between the matmuls.

That may sound like kernel-insider business, but the economic consequence is simple enough: if the surrounding pipeline stalls, your expensive hardware waits. I keep coming back to the same analogy. Buying Blackwell and then starving it with weak neighboring resources is like buying a race car and driving it through school pickup traffic. The horsepower is real. The line is also real.

The upstream flash-attention repository now positions FlashAttention-4 as optimized for Hopper and Blackwell GPUs, which makes this look less like a one-off lab flex and more like operating-stack infrastructure.

Kernel work changes the bill, not just the chart

This is where the story turns from technical curiosity into economics. If a better attention kernel lets the same Blackwell fleet push more useful tokens, absorb more concurrency, or hit latency goals without overprovisioning, the hardware line item does not change but the business outcome does. That is why the piece belongs next to open-weight inference economics, not just next to benchmark recaps.

Editorial diagram showing model requests flowing through a kernel layer that reshapes latency, utilization, and cost before workloads reach Blackwell GPU racks. — Figure / 02The hardware line item does not change, but a better kernel can still move latency, utilization, and cost per useful token.

It also matches the broader pattern behind Meta's custom-silicon inference play. The leverage is drifting lower in the stack. Better accelerators still matter, obviously, but the teams that reclaim the non-tensor choke points get a second chance to improve the return on the same capex. That is serious leverage for something people still lazily file under "kernel optimization."

A useful extra tell: Together says some FlashAttention-4 techniques are being incorporated into cuDNN 9.13 and 9.14 with NVIDIA's help. That makes the release harder to dismiss as open-source chest thumping. It looks more like a preview of how Blackwell performance will actually be contested: by kernel teams, vendor libraries, framework maintainers, and inference vendors all crowding into the same narrow performance gap.

My take on the Blackwell economics race

I would still keep a cold eye on the biggest claims. These are author-reported benchmarks, with the strongest evidence on B200 BF16. Operators should want more data on mixed batches, shorter contexts, real serving traces, and the speed at which gains reach ordinary stacks. Otherwise this can age into a very impressive lab result with a narrower production footprint.

But I do think FlashAttention-4 clarifies something important. In the Blackwell era, the money is moving toward whoever keeps the tensor cores from standing around with their hands in their pockets. That is not as glamorous as a giant FLOPs headline. It is much closer to the bill.

Source file

Public source trail

These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.

Primary source/together.ai/Together AI

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Primary source for the FlashAttention-4 bottleneck analysis, benchmark claims on B200, and the forward/backward kernel design changes.

Primary source/github.com/GitHub

Dao-AILab/flash-attention

Upstream repository showing FlashAttention-4 availability and its stated optimization focus for Hopper and Blackwell GPUs.

Primary source/nvidia.com/NVIDIA

NVIDIA Blackwell Architecture

Official architecture overview establishing Blackwell's positioning around inference scale, tensor-core advances, and rack-level AI infrastructure.

Primary source/arxiv.org/arXiv

Fast and Memory-Efficient Exact Attention with IO-Awareness

The original FlashAttention paper, useful for showing that the project has always been about memory movement and IO constraints rather than benchmark theater alone.

About the author

Lena Ortiz

Staff Writer

View author page

Lena tracks the economics and mechanics behind AI systems, from serving architecture and open-weight deployment to developer tooling, platform shifts, product decisions, and the operational tradeoffs that shape what teams actually run. Her reporting is aimed at builders and operators deciding what to trust, adopt, and maintain.

Published stories: 24
Latest story: Apr 10, 2026
Base: Berlin

Archive signal

AI Infrastructure Open Source AI AI Policy

Reporting lens: Operating leverage beats ideological posturing.. Signature: If the cost curve moves, the product strategy moves with it.

Article details

Category: AI Infrastructure
Last updated: April 11, 2026
Lead illustration: The loud number is throughput. The strategic story is who can turn Blackwell's non-tensor choke points back into useful work.
Public sources: 4 linked source notes

Byline

Lena OrtizStaff Writer

Covers the economics, tooling, and operating realities that shape how AI gets built, shipped, and run.

FlashAttention-4 turns Blackwell kernels into economics

FlashAttention-4 shows where Blackwell actually slows down

The bottleneck moved into the hallway around the matmuls

Kernel work changes the bill, not just the chart

My take on the Blackwell economics race

Public source trail

Lena Ortiz

More AI articles on the same topic.

AWS Agent Registry turns sprawl into a control layer

Meta's $21B CoreWeave deal runs through 2032

TorchTPU gives PyTorch a straighter path to TPU pods

FlashAttention-4 turns Blackwell kernels into economics

FlashAttention-4 shows where Blackwell actually slows down

The bottleneck moved into the hallway around the matmuls

Kernel work changes the bill, not just the chart

My take on the Blackwell economics race

Send this story into the feed loop.

Public source trail

Lena Ortiz

Share this story.

More AI articles on the same topic.

AWS Agent Registry turns sprawl into a control layer

Meta's $21B CoreWeave deal runs through 2032

TorchTPU gives PyTorch a straighter path to TPU pods