INSAIT and Netflix Build AI That Removes Objects From Video and Simulates Physics

Sofia University's INSAIT and Netflix co-developed VOID, an open-source AI model that deletes objects from video while physically simulating how the scene would change afterward -- beating existing tools on visual consistency.

When a cinematographer shoots a scene, they rarely get everything right in a single take. A crew member walks into frame. A prop lands in the wrong spot. A production logo needs to be wiped from a background for international distribution. Post-production fixes for these problems have historically required either a costly reshoot or hours of painstaking frame-by-frame editing. An AI model called VOID, developed jointly by Netflix and INSAIT at Sofia University, is designed to automate that work and do it better than existing tools.

The model was announced on April 8, 2026, and is immediately available as open-source software. The paper is on arXiv, the code is on GitHub under Netflix's repository, and a live demo runs on Hugging Face. For a research collaboration that produced something genuinely production-relevant, open-sourcing the whole stack on day one is a notable choice.

What VOID Actually Does

Most video inpainting tools work by treating the removed object's region as a hole that needs to be filled with plausible-looking background pixels. The system looks at surrounding frames and neighboring areas, then generates pixels that blend in visually. The problem is that this approach is purely cosmetic. It fills the gap without understanding what caused it.

VOID takes a different approach: it models physical causality. If you remove a person who is holding a glass, a purely visual fill would simply erase the person and paint in the background as if neither person nor glass had ever existed. VOID simulates what would actually happen next: the glass falls, hits the table, slides, rolls. The model tracks the physical interaction between the removed object and everything it was touching or affecting, then generates frames that show those downstream consequences playing out naturally.

Think of the difference this way. A standard inpainting tool is like a photo editor who paints over a blemish. VOID is more like a visual effects artist who understands the physics of the scene well enough to reconstruct what the footage would have looked like if the removed element had never been there in the first place.

The practical use cases are immediate. A character holding a prop needs to be cut from a licensed clip for redistribution. A brand logo needs to be removed from background signage for regulatory compliance in certain markets. A background actor crossed into a foreground shot. A cable rig is visible in an action sequence. These are exactly the kinds of fixes that currently require either expensive VFX time or impractical reshoots, and VOID is designed to handle them automatically.

The Architecture: CogVideoX, Quadmask, and Synthetic Data

VOID is built on top of CogVideoX, a large-scale video generation model. But the interesting engineering is in what INSAIT and Netflix added on top of that foundation.

Comparison table showing ai vs traditional video editing workflow data — AI vs Traditional Video Editing Workflow

The team developed a masking technique they call quadmask, which distinguishes between four types of regions in any given frame: the object being removed, the interaction zone around that object (where it touches or influences other elements), the background that would be visible after removal, and the temporal continuity regions that need to remain consistent across frames. By explicitly labeling these four zones rather than treating the removal area as a single undifferentiated hole, the model can apply different generation strategies to each type of region.

The training data problem was harder to solve. You cannot easily collect real video footage of objects being removed from scenes, because such footage does not exist in the wild at scale. What does exist in abundance is a tool for simulating it: Blender, the open-source 3D modeling software. Netflix and INSAIT built a synthetic dataset by generating video scenes in Blender, rendering them with specific objects present, then removing those objects and re-rendering to create paired training examples showing exactly what each scene should look like before and after removal. The model learned the physics of removal from simulated environments and then generalized to real footage.

Component	Role in VOID
CogVideoX backbone	Video generation foundation; handles temporal consistency across frames
Quadmask technique	Distinguishes object, interaction zone, background, and continuity regions
Blender synthetic training data	Provides physically accurate paired before/after removal video examples
Physical causality modeling	Simulates how removed objects affect surrounding scene dynamics

Key components of the VOID architecture and their functions in the pipeline.

This approach sidesteps one of the central bottlenecks in video AI research: the scarcity of labeled training data. Real-world video data is plentiful, but curated data showing specific editing operations with ground-truth outputs is not. Synthetic data generation through simulation is increasingly how teams at the frontier are solving this problem, and the Netflix-INSAIT pipeline offers a clear example of how to do it for physical object removal.

INSAIT: Bulgaria's Unlikely AI Research Powerhouse

The institutional context here is worth understanding. INSAIT is a research institute at Sofia University in Bulgaria, founded in 2022 with backing from the Bulgarian government and partnerships with ETH Zurich and EPFL. By most conventional metrics, Sofia would not appear on a short list of global AI research centers. The institute has been changing that perception systematically.

In early March 2026, INSAIT announced eight papers accepted at ICLR 2026, ranking first in Eastern Europe. The institute has ongoing joint programs with MIT's Computer Science and Artificial Intelligence Laboratory and has produced research that appeared in Nature Reviews Electrical Engineering. Google has provided over $1.5 million in grants to support the work. Its founder, Prof. Martin Vechev, previously held a chair at ETH Zurich and has built the institute with an explicit mandate to recruit world-class researchers to Eastern Europe.

The Netflix partnership fits that pattern. Rather than simply licensing research from a Western institution, Netflix brought its production and engineering problems to INSAIT and co-developed a solution. The result is a model that reflects real-world use cases rather than academic benchmarks, and that is now free for anyone to build on.

VOID highlights the role of INSAIT and the Bulgarian research community in creating globally significant technologies that could transform how video content is produced and edited.
INSAIT Institute statement on VOID release, April 8, 2026

Why Netflix Built This With a Bulgarian University

The collaboration raises a natural question: why partner with INSAIT rather than build this in-house or commission it from a larger established AI lab?

Part of the answer is INSAIT's specific depth in video AI and computer vision. The institute had previously published the StateSpaceDiffuser model for video generation and the Physics-IQ benchmark for evaluating whether AI models understand physical behavior in videos. That second project is directly relevant to VOID: if you want to build a model that correctly simulates what happens when an object is removed from a scene, you need researchers who think carefully about physical plausibility in video generation. INSAIT had done that work already.

Netflix, for its part, has a continuous need for exactly the kind of editing capabilities VOID provides. The platform serves over 300 million subscribers globally across wildly different regulatory environments, licensing requirements, and content standards. Automated object removal at production quality has genuine financial value for a company that constantly adapts content for different markets and distribution windows.

The decision to open-source the result also makes strategic sense. Netflix is not primarily in the AI tools business, and open-sourcing VOID allows the broader research community to improve it faster than Netflix could internally. INSAIT gains credibility and visibility. The field gains a new capability. The arrangement is mutually beneficial in a way that purely commercial partnerships often are not.

What Production Workflows Could Change

For film and television editors, VOID's most immediate value is in reducing the back-and-forth between editorial and visual effects departments. A significant portion of VFX work in production is not creative spectacle but corrective maintenance: removing unwanted elements, cleaning up errors, and fixing continuity problems. Those tasks are time-consuming and expensive, but they are also highly systematic, exactly the kind of work that benefits from automation.

Consider a typical scenario in episodic television. A background actor appears in a scene they were not meant to be in. Under current workflows, this might require a VFX supervisor to review the footage, scope the fix, assign it to a compositor, wait for the work to be done, and review the output before locking the edit. With a tool like VOID running on internal infrastructure, the same fix could potentially be generated automatically for editorial review, cutting the cycle time from days to minutes.

The implications extend to archival and licensing workflows too. Studios sitting on decades of content face ongoing challenges in clearing footage for new distribution contexts. An actor whose contract expired. A product placement that no longer applies. A network bug that was in-screen during original broadcast. VOID-class tools offer a path to updating archival content without manual frame-by-frame intervention.

The technology is not production-ready in the sense that a Netflix editor can drop it into their timeline tomorrow. It requires computational resources, integration with existing editing pipelines, and quality validation for specific use cases. But the underlying capability is now publicly available, and the path from research to production tool is shorter than it used to be.

Limitations and Open Questions

VOID is a research release, and the paper is appropriately honest about its limitations. The model performs best on objects with clear physical interactions and well-defined interaction zones. Complex scenes with multiple overlapping objects, fast motion, or unusual lighting conditions present harder problems that the current version does not fully solve.

There is also the question of computational cost. Large video generation models are expensive to run, and VOID's quality comes partly from the scale of the underlying CogVideoX backbone. Running the model at production speed and scale requires meaningful GPU infrastructure, which limits who can use it outside of well-resourced organizations.

The open-source release will help address some of these issues over time. Other researchers can now build on the quadmask approach, extend the training data, and optimize the inference pipeline. The Hugging Face demo gives non-technical users a way to test the model and generate feedback about failure modes that the researchers may not have anticipated.

What This Means for AI-Assisted Production

VOID is one data point in a larger trend: the steady movement of AI capabilities from research demonstrations into tools that address specific, high-value production problems. The model is not the first to tackle video inpainting, and it will not be the last. What distinguishes it is the physical causality modeling, the open-source release, and the institutional collaboration that produced it.

For anyone building video production tools, the paper and code are worth reading carefully. The quadmask approach to distinguishing interaction zones from simple background fill is a generalizable insight that could apply beyond object removal to other kinds of video editing tasks. The synthetic data generation pipeline using Blender is a practical template for any team facing the labeled-data scarcity problem in video AI.

And for anyone watching where serious AI research is happening, INSAIT is now on the map in a way that should inform how the industry thinks about talent and collaboration. The institute is not a novelty. It is producing work that Netflix thought worth co-authoring and open-sourcing to the world.