FlashAttention-4 makes Blackwell kernel work an economics story
FlashAttention-4 shows Blackwell-era AI economics will be shaped by attention kernel optimization and non-tensor bottlenecks, not FLOPs headlines alone.
When tensor cores scale faster than everything around them, the money moves to the kernel layer.

Lead illustration
FlashAttention-4 makes Blackwell kernel work an economics storyBenchmark headlines are going to miss the point on FlashAttention-4.
The loud claim from Together AI's launch post is easy to repeat: on Blackwell B200 with BF16, FlashAttention-4 reaches up to 1605 TFLOPs/s, or 71% utilization, with forward-pass performance up to 1.3x faster than cuDNN 9.13 and as much as 2.7x faster than Triton. Those are real numbers, and they matter. But if you stop there, you reduce the story to benchmark flex.
The more interesting line in the post is the one most launch coverage will rush past. The authors argue that from Hopper H100 to Blackwell B200, BF16 tensor-core throughput jumps from 1 to 2.25 PFLOPs, while shared-memory bandwidth and the number of special function units for operations like exponentials do not scale in step. That changes the shape of the problem. Attention is no longer governed mainly by how fast tensor cores can chew through the two GEMMs. More of the fight now sits in the plumbing around them.
That is why FlashAttention-4 matters. Not because it proves Blackwell is fast. Everyone already expected Blackwell to be fast. It matters because it shows Blackwell-era AI economics will be shaped by teams that can reclaim the non-tensor bottlenecks that raw hardware marketing leaves behind.
FlashAttention was always about the memory bill
The original FlashAttention paper in 2022 made its case through IO-awareness. Tri Dao and colleagues argued that ordinary attention wastes time and memory shuttling data between high-bandwidth memory and on-chip SRAM, and that better tiling could reduce those reads and writes without approximating the operation away. That was already a clue about where the project was heading. FlashAttention was never just a faster kernel. It was a reminder that the useful unit of performance is not advertised FLOPs but achieved work under real memory constraints.
FlashAttention-4 keeps the same instinct and updates it for Blackwell. In Together's telling, the forward pass is constrained by exponential throughput, while the backward pass is constrained by shared-memory traffic. So the kernel does not simply accelerate matrix math harder. It tries to overlap matrix math with those other bottlenecks, route more work through otherwise underused resources, and keep intermediate state closer to where the next stage needs it.
The concrete tricks matter here. The post describes a forward pipeline that overlaps tensor-core work, softmax, and memory operations; a software-emulated exponential path that spreads work across FMA units instead of leaning entirely on hardware MUFU instructions; and a backward pass that uses Blackwell tensor memory and 2-CTA MMA mode to reduce shared-memory pressure and cut atomic reduction overhead. The upstream flash-attention repository now presents FlashAttention-4 as optimized for Hopper and Blackwell GPUs, which tells you this is not a one-off internal stunt. It is becoming part of the operating stack.

This is the piece many GPU-buying narratives still flatten. The architecture headline is enormous compute. The operating question is whether the rest of the kernel can keep up. If the exponential path or shared-memory traffic stalls the lane, those shiny tensor cores sit there waiting. And idle premium hardware is not a technical embarrassment first. It is a cost problem.
Why this is an inference-economics story
That last point is where FlashAttention-4 becomes more than a kernel story. For inference teams, the unit that hurts is not theoretical peak performance. It is latency, throughput, and utilization at the serving layer. If a better attention kernel lets the same Blackwell fleet push more useful tokens, absorb higher concurrency, or hit latency targets without overprovisioning, then the economics change even if the hardware bill does not.
That is why this story belongs beside our earlier look at open-weight inference economics. Hardware choice matters, but the realized cost curve is shaped by stack discipline: scheduler quality, batching, KV-cache handling, library maturity, and now increasingly by kernel teams that know how to squeeze around architectural asymmetries. The same rack can look expensive or efficient depending on how much non-matmul waste survives in the path.
It also fits with the broader infrastructure pattern behind Meta's custom-silicon inference play. The strategic edge is moving lower in the stack. You can no longer assume that buying better accelerators ends the conversation. If Blackwell's tensor cores keep scaling faster than nearby resources, then whoever solves the surrounding bottlenecks gets a second bite at the same capex. That is powerful. It means kernel work can behave like infrastructure leverage, not just like optimization garnish.

There is an uncomfortable implication here for anyone still reading GPU markets through pure FLOPs headlines. The marginal winner may not be the company with the biggest benchmark chart. It may be the team that can turn awkward hardware asymmetry into higher sustained utilization in production. That team might be an accelerator vendor. It might be a library team. It might be an inference platform vendor. But the locus of value is clearly drifting toward software-hardware co-design, which is why so much of the publication's best work now ends up under the infrastructure desk.
FlashAttention-4 also says something about platform power
Another revealing detail in the blog post is that the authors say they worked with the cuDNN team to incorporate some FlashAttention-4 techniques into cuDNN starting with versions 9.13 and 9.14. That makes the release harder to read as simple open-source one-upmanship. It looks more like a preview of how Blackwell performance will actually be competed over: open kernels, vendor libraries, framework teams, and inference-stack builders all trying to close the same gap from different sides.
That matters for procurement and platform strategy. If the value is moving into co-design, then access to hardware is only the opening move. Real advantage comes from how quickly your software stack can absorb new kernel ideas, how much profiling talent sits near production workloads, and how dependent you are on a single vendor library cadence. Readers of our NVIDIA telecom AI-grid analysis have already seen the rack-scale version of this argument. Availability and interconnect matter. But so does the ability to convert installed hardware into economically useful service.
There is also a caution flag. FlashAttention-4's most dramatic claims are benchmark claims from the authors themselves, and the strongest evidence in the post is on B200 BF16. Operators should want to see more data on real serving traces, shorter sequence lengths, mixed batch shapes, and the speed at which these gains filter into mainstream stacks. The economics story is strongest when the optimization survives ordinary mess, not just a carefully chosen benchmark lane.
What to watch after the excitement fades
The next proof point is not another giant utilization number. It is whether FlashAttention-4 style gains show up broadly enough to change fleet planning, pricing assumptions, or deployment choices for LLM inference performance. If they do, the article will age well. If they do not, then it remains an impressive kernel paper with a narrower blast radius.
Either way, the release already clarified something important about the Blackwell era. The easy story says faster tensor cores will decide the market. The harder, more useful story says the market will be shaped by who can keep those tensor cores from waiting on everything around them. FlashAttention-4 is a sharp piece of evidence for the harder story. And in infrastructure, the harder story is usually the one that ends up on the bill.
Public source trail
These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.
Primary source for the FlashAttention-4 bottleneck analysis, benchmark claims on B200, and the forward/backward kernel design changes.
Upstream repository showing FlashAttention-4 availability and its stated optimization focus for Hopper and Blackwell GPUs.
Official architecture overview establishing Blackwell's positioning around inference scale, tensor-core advances, and rack-level AI infrastructure.
The original FlashAttention paper, useful for showing that the project has always been about memory movement and IO constraints rather than benchmark theater alone.

Lena Ortiz
Lena tracks the economics and mechanics of AI infrastructure: GPU constraints, serving architecture, open-weight deployment, latency pressure, and cost discipline. Her reporting is aimed at builders deciding what to run, not spectators picking sides.
- Published stories
- 4
- Latest story
- Mar 22, 2026
- Base
- Berlin · Systems desk
Reporting lens: Operating leverage beats ideological posturing.. Signature: If the cost curve moves, the product strategy moves with it.


