Skip to main content

Signed reporting across six AI categories, built to keep the archive useful after the launch noise burns off.

Edition brief6 AI categories/Stable category archives/Machine-readable discovery
AI InfrastructureSigned reporting
Published March 24, 2026

NVIDIA Dynamo is the orchestration layer above vLLM, not another inference server

NVIDIA Dynamo matters because it sits above vLLM, SGLang, and TensorRT-LLM to coordinate routing, KV reuse, disaggregated serving, and scaling across GPU fleets.

Lena OrtizStaff Writer8 min read
NVIDIA Dynamo is most interesting when you stop reading it as a faster server and start reading it as the control layer above the servers.
Editorial illustration of a distributed inference control layer sitting above multiple model-serving engines, routing requests and KV cache between GPU pools.
AI InfrastructureCover / AI Infrastructure

Lead illustration

NVIDIA Dynamo is the orchestration layer above vLLM, not another inference server
Cover / AI InfrastructureThe pitch is not "here is one more model server." It is "here is the layer that coordinates the servers you already use."

NVIDIA Dynamo 1.0 is easy to misread if you only skim the launch posts. At first glance it looks like one more inference framework entering a crowded aisle already stocked with vLLM, SGLang, TensorRT-LLM, Triton history, and enough “up to X faster” slides to make any operator suspicious on sight.

The more useful reading is different. Dynamo is not mainly trying to win as the model server. NVIDIA's own GitHub README says it plainly: Dynamo is “the orchestration layer above inference engines.” That one sentence does more explanatory work than most of the keynote framing. It tells you where the product sits in the stack and, just as importantly, where it does not sit.

If you are serving a single model on a single GPU, Dynamo is probably not the story. If you are trying to run large models across multiple GPUs or nodes, reuse KV cache intelligently, separate prefill from decode, and keep latency from turning into a public embarrassment when traffic gets weird, then Dynamo starts to look much more interesting.

What NVIDIA Dynamo actually is

The cleanest docs-backed description comes from NVIDIA's architecture docs. They split Dynamo into three cooperating planes: a request plane for request and response execution, a control plane for scaling and desired-state management, and a storage and events plane for KV state visibility, transfer, and reuse. That is not the anatomy of a simple inference server. It is the anatomy of a coordination system.

In that design, the backend engines still do the model-serving work. The frontend accepts traffic, the router decides where requests should go, prefill workers build prompt state, decode workers generate tokens, and KV events keep the cluster aware of reusable context. Above that, the planner reacts to load and latency targets. Alongside it, KVBM and NIXL deal with the ugly but important business of moving cache state around without wasting expensive GPU memory.

Diagram-style editorial illustration showing Dynamo above vLLM, SGLang, and TensorRT-LLM with request routing, planner logic, and separate prefill and decode GPU pools.
Figure / 01 The useful mental model is layered: backend engines still do token generation, while Dynamo tries to coordinate traffic, placement, and scaling above them.

That layered design is why the “NVIDIA Dynamo vs vLLM” framing is slightly wrong. vLLM is a serving engine. Dynamo can sit above vLLM, SGLang, or TensorRT-LLM and try to make a whole fleet behave like one coordinated inference system. NVIDIA's developer page is explicit about that backend support, and the repo's feature matrix keeps reinforcing the same point. The comparison is not server versus server so much as engine versus orchestration layer.

That stack shift matters because a lot of inference pain now lives above the kernel and below the product. We have already seen pieces of that story in our look at the vLLM gRPC and rendering shift and in the economics argument around open-weight inference operations. Once teams stop running toy loads, the hard part becomes coordination: where the request lands, whether you can reuse the cache, which GPU pool is overloaded, and how you stop one traffic spike from wrecking everyone else's latency.

Why this sits above the serving engine layer

NVIDIA's launch blog spends a lot of time on disaggregated serving, and for good reason. Prefill and decode are different jobs. Prefill is compute-bound. Decode is memory-bound. Forcing both phases onto the same GPU or node is convenient, but it is also a good way to leave performance on the table once sequence lengths and concurrency climb.

Dynamo's answer is to split those phases into separate pools that can scale independently, then add a planner that decides when disaggregation helps and when it does not. That planner piece matters more than the marketing copy admits. Disaggregated serving is not a magic spell. It adds transfer costs and coordination overhead. The real problem is deciding when the split improves throughput and SLOs enough to be worth it.

That is why the docs describe a control path devoted to capacity alignment, scaling targets, and placement. In other words: Dynamo is not just helping tokens come out faster. It is trying to decide how the system should be shaped while requests are arriving.

If that sounds a bit like cluster scheduling wearing an inference badge, well, yes. That is the point. NVIDIA even includes Grove for topology-aware Kubernetes deployment because once you start treating inference as a multi-component, multi-node workload, plain old “launch the server and hope” stops scaling.

KV-aware routing is the clearest proof of the thesis

The strongest argument for Dynamo as an orchestration layer is KV-aware routing. This is where the system starts to look less like a faster endpoint and more like traffic control for expensive context.

NVIDIA's docs and launch materials describe the router as tracking KV overlap across workers, then routing new requests based on both cache overlap and load. The goal is obvious: avoid paying the prefill bill again when a worker already has useful context. In long-context and agentic workloads, that can be a very real saving.

Baseten's case study is the best independent support in the source pack, even with the usual benchmark caveats. Baseten says it saw a 50% reduction in average time to first token and a 34% reduction in time per output token on one Qwen3 Coder workload when KV-aware routing was enabled, plus lower tail latency on shadowed production traffic. That does not validate every NVIDIA slide. It does validate the broader idea that routing based on cache overlap can matter a lot in the right workload shape.

Editorial figure showing a request being routed to the worker with the best KV cache overlap, then spilling older cache blocks from GPU memory to cheaper storage tiers.
Figure / 02 KV-aware routing and multi-tier cache management are where Dynamo starts to look less like a server and more like inference traffic control.

This is also where Dynamo's multi-tier KV story becomes more than a footnote. The docs describe KVBM as a manager for reuse, eviction, and offload across memory tiers, while the developer page frames it as a way to move older or colder KV state out of scarce GPU memory into CPU memory, SSD, or remote storage. Again, that is not basic model serving. It is memory orchestration for serving clusters.

What the official claims do and do not prove

NVIDIA's official pages are full of huge numbers: up to 30x in one launch blog, up to 7x in the 1.0 production announcement, and even larger Blackwell-era figures elsewhere. Those claims belong in the article because they are part of how NVIDIA is positioning Dynamo, but they should be read as vendor framing, not settled fact.

What the docs-backed material proves is narrower and more useful. It shows that Dynamo is architected around separate request, control, and state paths; that it supports vLLM, SGLang, and TensorRT-LLM; that it is built for disaggregated serving, KV-aware routing, and multi-tier cache management; and that NVIDIA is treating failure handling, load shedding, and topology-aware placement as first-class concerns.

That alone is enough to make Dynamo important. It means NVIDIA is trying to move the competitive frontier upward. Not just faster kernels, not just better batching, not just another server wrapper, but a higher layer that can coordinate engines and infrastructure together. That is consistent with the broader Blackwell-era story we touched in FlashAttention 4 and Blackwell kernel economics: the stack is becoming more vertically optimized, and software control layers are now part of the performance argument.

Why this matters for the rest of the inference ecosystem

If Dynamo works, the strategic consequence is bigger than one NVIDIA launch. It suggests that the model-serving engine becomes only one layer in a larger inference operating surface. vLLM, SGLang, and TensorRT-LLM still matter, but more of the operational value may move into the system that coordinates them across a fleet.

That has two implications. First, “distributed inference orchestration” becomes its own product category instead of a grab bag of custom glue code. Second, backend choice may matter a bit less in isolation if a higher layer is absorbing more of the routing, scaling, and cache logic.

There is a familiar pattern here. Once a market matures, the fight often moves one layer up the stack. We have seen that in telecom inference distribution with NVIDIA's AI-grid push, and we have seen adjacent pressure in the portability debate around vLLM, Triton attention backends, and AMD support. The engine is still important. It is just no longer the whole story.

That does not mean every team now needs Dynamo. The README itself says the opposite for simple deployments. But it does mean NVIDIA has identified where the next serving bottleneck really lives: not only inside token generation, but in how an entire inference fleet gets coordinated under messy, bursty, long-context production traffic.

That is why NVIDIA Dynamo 1.0 deserves attention. Not because it is one more inference server launch. Because it is an attempt to become the layer above the launch.

Source file

Public source trail

These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.

Primary sourcedeveloper.nvidia.comNVIDIA Developer
Dynamo Inference Framework

Official product page describing Dynamo as an open-source distributed inference framework with planner, router, KV management, NIXL, and Grove.

Primary sourcedeveloper.nvidia.comNVIDIA Developer Blog
NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models

Launch architecture post explaining disaggregated serving, planner behavior, smart routing, and KV cache offload.

Primary sourcedocs.nvidia.comNVIDIA Docs
Overall Architecture

The clearest docs-backed explanation of Dynamo's request plane, control plane, and storage-events plane.

Primary sourcegithub.comGitHub
Dynamo README

README states directly that Dynamo is the orchestration layer above inference engines rather than a replacement for them.

Supporting reportingbaseten.coBaseten
How Baseten achieved 2x faster inference with NVIDIA Dynamo

Third-party case study focused on KV-aware routing, with concrete latency and throughput claims from one deployment pattern.

Supporting reportingnvidianews.nvidia.comNVIDIA Newsroom
NVIDIA Enters Production With Dynamo, the Broadly Adopted Inference Operating System for AI Factories

Use cautiously for ecosystem adoption claims and NVIDIA's own framing of Dynamo as an operating system for AI factories.

Portrait illustration of Lena Ortiz

About the author

Lena Ortiz

Staff Writer

View author page

Lena tracks the economics and mechanics behind AI systems, from serving architecture and open-weight deployment to developer tooling, platform shifts, product decisions, and the operational tradeoffs that shape what teams actually run. Her reporting is aimed at builders and operators deciding what to trust, adopt, and maintain.

Published stories
9
Latest story
Mar 24, 2026
Base
Berlin

Reporting lens: Operating leverage beats ideological posturing.. Signature: If the cost curve moves, the product strategy moves with it.

Related reads

More reporting on the same fault line.

AI Infrastructure/Mar 22, 2026/7 min read

FlashAttention-4 makes Blackwell kernel work an economics story

FlashAttention-4 shows Blackwell-era AI economics will be shaped by attention kernel optimization and non-tensor bottlenecks, not FLOPs headlines alone.

Editorial illustration of a Blackwell server aisle where wide tensor-compute lanes narrow into shared-memory and softmax bottlenecks before a tuned attention pipeline opens the flow again.
AI InfrastructureFiled / MAR 22, 2026

Lead illustration

FlashAttention-4 makes Blackwell kernel work an economics storyRead FlashAttention-4 makes Blackwell kernel work an economics story
Filed / MAR 22, 2026The loud number is throughput. The strategic story is who can turn Blackwell's non-tensor choke points back into useful work.
AI Infrastructure/Mar 20, 2026/6 min read

NVIDIA AI grids turn telcos into inference resellers

NVIDIA's AI grid pitch turns telecom networks into distributed inference sellers, but operators still need products developers and buyers will actually use.

Editorial illustration of a telecom tower radiating distributed inference lanes across nearby edge sites, roads, devices, and city infrastructure.
AI InfrastructureFiled / MAR 20, 2026

Lead illustration

NVIDIA AI grids turn telcos into inference resellersRead NVIDIA AI grids turn telcos into inference resellers
Filed / MAR 20, 2026The AI-grid pitch is really a plan to turn the telecom footprint into sellable inference capacity.
NVIDIA Dynamo is the orchestration layer above vLLM, not another inference server | AI News Silo