Verification is not here for free

We watch movies to feel things we have not felt. We play games to live lives we will not live. We do it to see the world from somewhere outside ourselves, to enrich how we understand it. And long before any of us learned from a textbook, we learned from stories and play. Children build a picture of the world out of cartoons, novels, and games years before they sit down to study it formally.

If AI is going to understand the world the way humans do, this is where it has to learn too. Not from text alone, but from the dense, multimodal artifacts humans have spent a century making to teach themselves what reality looks and feels like.

That is the bet behind AgenticVBench, the first benchmark we are releasing at Philo Labs. It measures how well today's best AI agents can do the work of a film post-production team. Not because film is the only goal, but because film is one of the hardest, richest, and most honest playgrounds for multimodal AI.

Why AI has gotten so good at math and code

A lot of recent AI progress has come from training in domains where it is easy to check whether an answer is right. Math problems either solve or they don't. Code either runs or it doesn't. When you can grade the output, you can train the model to do better.

Even there, “right” is partial. A proof can be technically correct and mathematically uninteresting. Code can pass every test and still be bad code. But the part that is gradable was enough. Enough to teach models to reason, to solve, to chain steps together in ways the field could not do three years ago.

The grading was already there. Mathematicians spent two thousand years building formal proofs. Engineers built decades of test infrastructure. AI walked in, used what was already there, and got better.

Creative work does not come with that grading. There is no test suite for a good cut. No formal proof that a film flows. The judgment lives in the eyes and hands of practitioners, passed down by example. To train AI on it, someone has to do the hard work of writing down what good looks like, in a way a computer can check. That is what AgenticVBench is.

Why film, and why now

Films and games are where multimodal intelligence has the furthest to go. Mulholland Drive asks you to hold lighting, music, dialogue, and continuity in your head for two hours and figure out what is real. Elden Ring builds meaning out of the interplay between environment, mechanics, and your own attention. These are not surface-level objects. They reward, and require, deep understanding.

If a model wants to learn the world the way humans do, these are the artifacts it has to read. And right now, no model can read them well. (For the longer argument on why, see our earlier essay, Why verification is harder than generation.)

What AgenticVBench measures

AgenticVBench gives AI agents one hundred real tasks from the world of post-production, across four kinds of work. Assembly: given a storyboard and a pile of clips, pick the right clip for each slot. Repair: find the defect in a cut, a frozen frame or a wrong scene or a color drift, and fix it. Sequencing: given a story and a shuffled set of clips, put them back in the right order. Repurpose: given hours of source footage and a brief, cut it down into a short deliverable that tells the story.

Tasks span thirty minutes to a full week of expert human work. Twenty film professionals, averaging six years of experience, wrote them.

The benchmark tests two things at once. Can the agent see and hear what is actually in the footage? And can it produce something a director would accept? Post-production is where those two questions meet, which is why we started there.

The hardest part was making it gradable

Building the tasks was not the hard part. The hard part was making them gradable. To train AI on creative work, you need a grader. And creative judgment is not, by default, gradable.

Take a question a producer might actually ask: after a stranger watches this short, would they want to watch the full film? Phrase it that way, give it to three film experts, and they will agree less than 40% of the time. Not because the experts are bad, but because the question is. Subjective questions invite subjective answers, and a grader built on top of them is too noisy to train against.

The work was reframing the questions so they have answers. We break the big subjective question into concrete sub-questions, and define each one in advance. Take a short where two characters are dangling from climbing harnesses and the comedy hinges on one threatening to unhook the other. Instead of asking whether the twist lands, the rubric spells out exactly what counts: the climax must show one character actually unhooking the other's safety carabiner, with the unhook visibly on screen for at least two seconds (the gloved hand on the buckle, the rope separating, the character detaching from the harness), or called out by voiceover or subtitle. That is a question a grader can answer by watching. Three experts give the same answer.

Writing that rubric takes more than craft. A great mathematician is not always a great math teacher. A senior director can see instantly that a cut is wrong without being able to name, in advance, the rule that would catch it. Articulating the rule is a different skill from having the taste. We spent many iterations refining how we work with practitioners in our community, selecting the ones who can sit beside a researcher and jointly turn instinct into tasks and rubrics.

When the rubric can be compressed into code, we compress it. A programmatic verifier is the cleanest one we know: deterministic, fast, and the same every time it is run. We push every rubric toward that form as far as it will go. When the question resists code, when it requires watching rather than running, we fall back to expert grading. That is where the inter-rater agreement number lives.

Done this way, the same three experts now agree 95% of the time. That is the achievement that took the most work, and it is what makes the rest possible.

The gap

The best AI agent we tested barely passes 30%. Human experts score 89%. The gap is real.

A grader that agrees with experts 95% of the time is not only an eval. It is a reward model.

In plain terms: once you can grade an output reliably, you can train against the grading.

We are happy to be the first to push verification for visual AI: to take creative work, make it verifiable, and let AI learn it through agentic reinforcement learning. The path from here runs toward what we think of as self-improving creative superintelligence. Systems that can both perceive and make at the level of human craft, and get better at it on their own. Film is where we are starting. Games and interactive media are next.

That is the world we want to help build. One where productivity work is automated, and the rest of us are free to create, to consume, and to ponder.

See the benchmark