vLLM 0.18.0 points to a split serving stack for multimodal inference
vLLM 0.18.0 signals a split multimodal serving stack, with render, transport, and GPU inference starting to separate into cleaner infrastructure tiers.
vLLM's most interesting 0.18.0 move is not one faster endpoint. It is the idea that multimodal serving can be split into cleaner tiers.

Lead illustration
vLLM 0.18.0 points to a split serving stack for multimodal inferenceThe easy way to read vLLM 0.18.0 is as a release-note bundle: gRPC serving support, GPU-less render serving, and scheduler gains for PD disaggregation all landed in one release. That reading is accurate, but it misses the more interesting pattern.
The real signal is architectural. vLLM is starting to make it practical to split multimodal serving into cleaner layers, where request rendering, transport, and GPU inference do not have to live on the same machine or even inside the same process boundary.
That does not mean most teams will suddenly deploy a beautifully disaggregated serving mesh next week. It does mean an important open-source serving stack is now exposing exactly the seams operators need if they want to stop treating multimodal inference as one monolithic box.
That is the story here, and it is different from our recent piece on vLLM's Triton attention backend and AMD portability. That article was about backend credibility across hardware vendors. This one is about where the serving stack itself gets cut apart.
gRPC matters here, but not for the usual reason
The release notes frame gRPC serving as a new option alongside the existing HTTP path, and that is how it should be read. The current OpenAI-compatible server docs still describe the standard HTTP interface as a first-class way to run vLLM. Nothing in 0.18.0 says HTTP is suddenly obsolete.
The more revealing detail sits in the gRPC PR. vLLM adds a --grpc flag to vllm serve, but the actual servicer is lazy-imported from a separate smg-grpc-servicer package, with proto stubs pulled from smg-grpc-proto. In other words, the transport layer is being loosened from the core runtime rather than fused more tightly into it.
That matters because it points to a serving stack that is getting more modular at the edges. Operators who want HTTP can keep HTTP. Teams that want RPC-style integration can add gRPC without waiting for the whole runtime to behave like one permanent in-tree endpoint. The strategic point is not that gRPC is inherently superior. It is that transport is becoming more replaceable.
That sounds mundane. It is not. Monolithic serving stacks usually hide transport assumptions deep enough that every new interface becomes a platform commitment. Here, vLLM is moving in the opposite direction. The transport layer is starting to look like a choice.

For teams already thinking about service boundaries, that is useful. A preprocessing tier, an ingress layer, and a GPU inference tier do not need to share the same client contract forever. Some environments will still prefer the plain OpenAI-compatible HTTP surface. Others may want gRPC for internal service-to-service calls. The important thing is that the runtime is no longer pushing quite as hard toward one serving shape.
The bigger change is vllm launch render
If the gRPC addition loosens the transport boundary, the new vllm launch render path loosens something even more expensive: the assumption that preprocessing has to sit next to GPU inference.
The render server is explicitly CPU-only. According to the PR, it can run the request preprocessing pipeline — chat template rendering, tokenization, tool parsing, reasoning parsing, and input preparation — without a GPU and without an inference engine. That is a sharper shift than it may sound at first glance.
Multimodal systems have been moving in exactly the wrong operational direction for years. They keep accumulating more work before the model proper ever starts generating: media fetching, image decoding, prompt shaping, tool schema handling, tokenizer quirks, and more. vLLM's own multimodal inputs documentation is a reminder that modern request payloads can involve images, video frames, audio extraction, and media-domain controls before you even reach the token loop.
That kind of preprocessing does not always belong on your expensive GPU node.
In fact, it often very obviously does not. CPU-heavy media handling, request sanitation, and template rendering can compete with inference for memory headroom, operational simplicity, and failure isolation. Once vLLM exposes a formal render tier, the stack starts to look less like one model server and more like a pipeline: prepare requests here, move them cleanly across the network, and spend GPU time only on the part that truly needs the accelerator.
Readers of our piece on open-weight inference economics will recognize why this matters. GPU costs are not only decided by list price. They are decided by how much useless neighboring work you force those machines to carry.
PD disaggregation stops looking like an edge case
The third change is easier to overlook because it sounds narrower: scheduler overhead improvements for PD disaggregation. But the details are revealing.
The PR targets async remote KV loads in prefill/decode disaggregation. Instead of repeatedly popping and re-prepending blocked requests in the main waiting queue, the scheduler now keeps a separate queue for remote-KV waits and promotes requests when finished_recving fires. The author reports around 5% end-to-end performance improvement in that setup.
You do not need to take the exact number on faith to see why it matters. This is scheduler work aimed at a system where prefill and decode are already split enough that remote KV transfer is a first-order concern. That is not the kind of optimization you prioritize if you think one-box serving is the long-term center of gravity.

Put the three changes together and the pattern gets hard to ignore. One change loosens transport. Another peels preprocessing away from GPU inference. A third reduces control-plane friction in a disaggregated prefill/decode setup. None of these changes alone proves a new industry standard. Together, they look like a stack getting ready for split serving topologies.
That should also sound familiar if you read our FlashAttention-4 analysis. The performance fight keeps moving away from the obvious headline layer and toward the plumbing around it. In Blackwell-land that meant kernel work around non-tensor bottlenecks. In vLLM 0.18.0 it means the serving runtime is getting more explicit about where preprocessing, transport, and cache movement belong.
Operators should be interested, but not romantic
There is an easy way to oversell this release, and it would be sloppy.
First, do not confuse interface diversity with adoption. The presence of --grpc does not tell us how many production fleets will choose it. Second, do not confuse CPU-only render serving with a turnkey multimodal control plane. Plenty of ugly integration work still sits between a neat render endpoint and a real split-stack deployment. Third, do not assume disaggregated prefill automatically wins. It adds its own network, queueing, and observability headaches.
Still, the operator case is real.
If you are serving multimodal workloads, there is a strong argument for isolating request rendering and media preparation from accelerator-heavy generation. If you are experimenting with internal service meshes, transport optionality matters even when HTTP remains perfectly fine at the edge. And if you are pushing toward split prefill/decode architectures, scheduler work around remote KV movement is exactly the kind of boring improvement that ends up determining whether the idea survives contact with production.
This is also why the piece belongs beside our OpenAI agents platform analysis, even though the subject is different. The larger infrastructure trend is the same: the valuable layer is increasingly the orchestration layer, not just the endpoint somebody demos on stage.
What to watch after 0.18.0
The next proof point is not whether people tweet about gRPC. It is whether vLLM and its surrounding ecosystem keep deepening the split-stack path.
Watch for three things. One: whether render serving becomes a normal staging tier for multimodal deployments rather than a niche utility route. Two: whether transport diversity expands without turning the runtime into a fragmented mess. Three: whether disaggregated prefill/decode work keeps receiving scheduler, caching, and deployment attention instead of sitting as an advanced-user side alley.
If those things happen, vLLM 0.18.0 will read less like a grab bag of quality-of-life features and more like an early architectural marker. Open-model serving will still have plenty of one-box deployments. It should. Simplicity is a feature. But the interesting frontier is no longer hard to see.
The stack is starting to split.
Public source trail
These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.
Release notes confirming the 2026-03-20 ship date and the three user-facing changes at the center of this piece: gRPC serving, GPU-less render serving, and PD-disaggregation scheduler gains.
Primary source for the new `vllm serve --grpc` flag, the optional `vllm[grpc]` install path, and the decision to move the servicer into a separately versioned package.
Primary source for the CPU-only render server that handles chat template rendering, tokenization, and request preprocessing without a GPU or inference engine.
Primary source for the scheduler changes that reduce queue churn around async remote KV loads in prefill/decode disaggregation.
Confirms the existing HTTP OpenAI-compatible serving path remains a first-class interface.
Grounds the article's claims about multimodal preprocessing complexity and the operational implications of separating media handling from GPU inference.

Lena Ortiz
Lena tracks the economics and mechanics behind AI systems, from serving architecture and open-weight deployment to developer tooling, platform shifts, product decisions, and the operational tradeoffs that shape what teams actually run. Her reporting is aimed at builders and operators deciding what to trust, adopt, and maintain.
- Published stories
- 8
- Latest story
- Mar 24, 2026
- Base
- Berlin
Reporting lens: Operating leverage beats ideological posturing.. Signature: If the cost curve moves, the product strategy moves with it.


