Skip to main content

Netflix VOID model removes the object and the aftermath

Netflix has released VOID, a video-editing model that removes objects and their physical aftermath, but using it still means masks, two passes, and 40GB-plus VRAM.

Filed Apr 4, 20266 min read
Editorial illustration of a removed person leaving behind a physically updated scene, with a guitar dropping through empty space and a quadmask-style overlay marking object, overlap, affected regions, and background.
ainewssilo.com
The clever part is not deleting the person. It is deleting the trouble they caused.

Netflix has publicly released VOID, and it is one of those rare model drops where the interesting part is not a shinier demo clip. It is the causal mess underneath. VOID is a video-editing model that removes an object from a scene and tries to remove the physical interactions that object caused, according to the Hugging Face release, GitHub repo, project page, and paper.

That sounds small until you picture the standard failure mode. Normal video object removal is pretty good at patching over the space where something used to be. Shadows, reflections, missing background, fine. But if you remove the person who was holding the guitar, a normal system often leaves the guitar floating there like it has hired its own lawyer. VOID is trying to fix that part too.

I find that more interesting than yet another generic video-model launch. This is not a consumer Netflix feature wedged between "Skip Intro" and "Next Episode." It is a public research release with code, checkpoints, a demo Space, and a Colab notebook. Public, yes. Easy, no.

What the Netflix VOID model actually does

The core pitch is interaction-aware deletion. The paper says existing methods handle appearance-level cleanup, but they fall apart when the removed object had meaningful physical effects on the scene. VOID is built for those cases. Remove the object, then update the downstream action so the shot still behaves like the world remembers gravity.

Physics keeps receipts.

The project page gives a simple example: if you remove a person holding a guitar, VOID also removes the person's effect on the guitar, so it can fall naturally instead of hanging in the air like a continuity error with attitude. The GitHub samples make the same point with smaller scenes, like a lime falling on a table or a ball rolling after another object disappears.

That difference matters because ordinary inpainting is a bit like repainting the wall after you remove a picture frame. VOID is trying to repaint the wall and explain why the vase never got knocked off the shelf. That is a much harder job, and also a much more useful one if video editing models are supposed to do more than cosmetic cleanup.

Editorial explainer figure showing VOID's quadmask logic across four regions: object to remove, overlap area, affected interaction region, and preserved background.
Figure / 01VOID does not just mark the object. It also marks the places where the object's existence changed the rest of the shot.

Why Netflix VOID video object removal is different

The clever mechanism is the quadmask. Instead of the usual "delete this blob and pray" setup, VOID uses four values: the primary object to remove, overlap regions, affected regions, and background to keep. The Hugging Face model card and GitHub repo spell this out clearly.

That affected-region lane is the whole story. During inference, a vision-language step identifies what else in the frame should change because the object is gone. The repo's mask pipeline uses SAM2 for segmentation and Gemini for reasoning about those interaction-affected regions. In other words, the model is not just painting over pixels with a larger mop. It is getting a structured hint about where causality spread through the scene.

I would not oversell that into "the model understands physics" because that is how you end up back in the swamp described in our piece on benchmark trust recession. But it is a real step beyond standard object erasure. The system is being trained to generate a counterfactual video, not merely a cleaner patch.

How to use the Netflix VOID model right now

The good news is that this is a genuine public package. As of April 4, the release surfaces are live across Hugging Face, GitHub, the project page, a demo Space, and a Colab notebook. That is already more useful than a paper with a Vimeo link and a prayer.

The less-good news is the hardware bill. The quick-start notebook says it requires a GPU with 40GB+ VRAM, with an A100 given as the example. So yes, it is public. It is just public in the way a professional pizza oven is public: you can absolutely use it if you happen to own the restaurant.

The easiest route is the notebook, but the full workflow is still fairly serious. You install dependencies, download the base CogVideoX-Fun-V1.5-5b-InP model plus VOID's checkpoints, prepare a source video, generate a quadmask, and provide a prompt describing the scene after removal. If you want the full mask pipeline, the repo also expects SAM2 and a Gemini API key for the reasoning stage.

Editorial diagram of the VOID workflow moving from clicked object, to mask generation, to pass-one video inpainting, to optional pass-two temporal refinement.
Figure / 02The public package is real, but it is still a workflow: click, mask, prompt, run pass one, then optionally run a cleanup pass for consistency.

Netflix VOID model limits: 40GB VRAM, masks, and two passes

This is where the launch stops sounding magical and starts sounding honest, which I appreciate. VOID has two checkpoints. Pass 1 is the base inpainting model and is enough for most videos. Pass 2 is optional and uses optical-flow-warped latent initialization to improve temporal consistency on longer clips or clean up morphing artifacts. That is smart engineering, but it is still a two-step workflow, not a magic eraser.

There are other limits too. The default resolution is 384x672, max length is 197 frames, and the training setup described in the release used 8x A100 80GB GPUs. The repo also notes that licensing constraints prevent Netflix from shipping the prebuilt training data, so it releases the data-generation code instead. This is a serious research package, not a frictionless creator app.

That does not make the release less important. If anything, it makes it more useful for the people who actually care about reproducibility and access, which is the same practical frame we keep coming back to in open-weight inference economics. A public package with annoying constraints is still far more valuable than a glossy demo nobody can inspect.

One caution is obvious and worth saying plainly: cleaner video object removal also sharpens misuse risk. Better deletion tools make it easier to falsify scenes, not just tidy them. So the real upside of a public release is not "problem solved." It is that researchers and developers can inspect the method, test limits, and argue about safeguards in the open instead of squinting at a teaser trailer.

The bottom line is not that Netflix has built the next Sora. It has done something narrower and, honestly, more interesting. VOID tries to remove the shove, not just the shover. For video editing, that is a real leap.

Share this article

Send this story into the feed loop.

Pass the story on without losing the canonical link.

Share to network

Source file

Public source trail

These links anchor the package to the underlying reporting trail. They are not a substitute for judgment, but they do show where the reporting starts.

Primary source/huggingface.co/Hugging Face
netflix/void-model

Confirms the public model release, quick-start notebook, 40GB-plus VRAM guidance, quadmask format, and the two-pass checkpoint structure.

Primary source/github.com/GitHub
Netflix/void-model

Main implementation source for setup, SAM2-plus-Gemini mask pipeline, quadmask semantics, inference commands, and optional manual refinement.

Primary source/void-model.github.io/Netflix / VOID authors
VOID project page

Best concise explanation of the method, especially the causal interaction angle and the optional second pass for morphing artifacts.

Primary source/arxiv.org/arXiv
VOID: Video Object and Interaction Deletion

Paper abstract anchors the main claim: VOID targets physically plausible counterfactual video editing, not standard appearance-only removal.

Primary source/huggingface.co/Hugging Face
VOID demo Space

Confirms there is a public demo surface associated with the release.

Primary source/colab.research.google.com/Google Colab
VOID Colab notebook

Confirms there is a public notebook entry point for trying the released project, even though the hardware requirement remains heavy.

Portrait illustration of Idris Vale

About the author

Idris Vale

Staff Writer

View author page

Idris writes about the institutional machinery around AI, but the lens is broader than policy alone: procurement frameworks, public-sector buying rules, platform leverage, compliance burdens, workflow risk, and the market structure hiding beneath product or infrastructure headlines. The through-line is practical power, not abstract theater.

Published stories
13
Latest story
Apr 5, 2026
Base
Brussels · London corridor

Reporting lens: Follow the buying process, not just the bill text.. Signature: Policy turns real when someone has to buy the system.

Article details

Last updated
April 4, 2026
Public sources
6 linked source notes

Byline

Portrait illustration of Idris Vale
Idris ValeStaff Writer

Tracks the institutions, incentives, and market structure that quietly decide which AI systems get deployed and why.

Related reads

More AI articles on the same topic.