The generator's advantage

Generating a video is, at its core, a sampling problem. You draw from a learned distribution, and if the distribution is rich enough, some fraction of your samples will be compelling. You don't need every frame to be perfect. You don't even need most of them to be. You need the gestalt to be convincing - a plausible texture of reality.

This is why video generation has improved so rapidly. The generator has a structural advantage: it only needs to fool, not to understand. A diffusion model does not need to know that a hand has five fingers. It needs to produce pixel patterns that, to a human viewer, read as a hand. These are different problems with very different difficulty curves.

The progress has been extraordinary. In three years, we went from obviously synthetic motion to outputs that are, in many contexts, indistinguishable from camera footage. Every frontier lab has a video model. The generation stack is converging on well-understood architectures. The remaining gains are largely engineering - more compute, cleaner data, better schedules.

The hard part is over for generation. It has barely begun for verification.

What verification actually requires

Verification - determining whether a generated video is correct - sounds like it should be easier than generation. In language, this intuition holds. It is easier to check that a proof is valid than to produce one. It is easier to read code than to write it. Verification in language benefits from the discrete, compositional structure of text. You can check a claim against a fact. You can parse a logical chain. You can diff two strings.

In video, none of this applies.

A video is a high-dimensional continuous signal. A single frame at 1080p contains over two million pixels, each with three color channels. At 30 frames per second, one minute of video is roughly 3.7 billion numbers. The space of possible videos is not just large - it is combinatorially incomprehensible. And the space of correct videos, given some intended meaning, is an infinitesimally thin manifold within that space.

Verification in video requires understanding, not pattern matching. To know whether a generated video of a person walking through a kitchen is correct, you need to understand gravity, occlusion, material properties, lighting consistency, the biomechanics of human gait, the spatial layout of kitchens, the temporal coherence of object permanence. You need a model of physics. You need a model of the world.

The generator only needs to produce a plausible surface. The verifier needs to understand the reality underneath.

This is the fundamental asymmetry. And it gets worse.

The pixel-level problem

In language, errors tend to be discrete and locatable. A factual error is in a specific sentence. A logical fallacy is in a specific argument. You can point to it.

In video, errors are often distributed and subtle. A shadow that falls at the wrong angle. A reflection that doesn't quite match the scene geometry. A texture that drifts imperceptibly over 30 frames. A hand that has the right number of fingers but the wrong articulation dynamics. These errors are not in any single pixel. They are in the relationships between pixels across space and time.

This means verification cannot be decomposed into local checks. You cannot verify a video frame-by-frame, the way you might verify a document paragraph-by-paragraph. Visual correctness is holistic. It emerges from the coherence of the entire spatiotemporal volume. A single frame can look perfect in isolation and be obviously wrong in context - a cup that teleports two inches between frames, a shadow that reverses direction, a face that subtly changes identity.

Building systems that can detect these failures requires something close to full visual understanding. You need to reconstruct, at least implicitly, the physical scene that the video purports to depict - and then check whether the pixel-level rendering is consistent with that reconstruction.

This is not an incremental extension of existing RLHF pipelines. It is a different kind of problem entirely.

Why text alignment doesn't transfer

The current alignment stack for language models - RLHF, DPO, constitutional AI, and their variants - was designed for a domain with two key properties: discrete outputs and relatively cheap human evaluation. A human annotator can read two text responses and decide which is better in a few seconds. The preference signal is noisy but fast, and the output space is structured enough that reward models can generalize from a manageable number of comparisons.

Video breaks both assumptions.

Human evaluation of video quality is slow, expensive, and wildly inconsistent. Ask ten people whether a generated video looks “realistic” and you will get ten different answers, because they are each attending to different dimensions of realism. One notices the lighting. Another notices the motion. A third is distracted by an artifact that the others missed. The inter-annotator agreement for video quality assessment is far lower than for text, and it degrades further as the videos get longer or more complex.

More fundamentally, the reward signal for video is not a scalar. A video can be physically accurate but aesthetically poor. It can be visually stunning but temporally incoherent. It can be spatially consistent but semantically wrong - a photorealistic video of a dog that was supposed to be a cat. These dimensions of quality are partially independent, context-dependent, and often in tension. Collapsing them into a single preference ranking destroys the information that a training signal actually needs.

This is why naive applications of RLHF to video models produce mediocre results. The reward model is trying to learn a scalar function over a space that requires a structured, multi-dimensional judgment. It's like trying to navigate a city with a compass that only points “generally good.”

Why VLM-based reward is not verifiable

The natural instinct is to use vision-language models as reward models. If a VLM can describe a video, surely it can judge one. This instinct is wrong - and dangerously so.

VLMs have blind spots and they hallucinate. They confidently describe objects that aren't present and miss artifacts that are obvious to a human viewer. They process video by sampling a handful of frames at fixed intervals, which means they can skip the exact frames where errors occur. A hand that sprouts a sixth finger for three frames at 30fps may never appear in the sampled keyframes. A physics violation that lasts 100 milliseconds is invisible to a model that samples at 2fps.

A reward signal you cannot verify is worse than no signal at all. It does not merely fail to guide training - it actively misguides it. The model learns to optimize for the VLM's blind spots, producing outputs that score well under a flawed evaluator while drifting further from actual correctness. You get videos that are confidently wrong in ways the reward model cannot see. This is reward hacking, but in a domain where the consequences are far harder to detect than in language.

The verifiability of the reward signal is not a nice-to-have. It is the foundation on which everything else rests. If you cannot check whether your evaluator is correct, you cannot know whether your model is improving.

Verifiable reward requires disciplinary expertise

Constructing a verifiable reward signal for video is not a pure ML problem. It requires domain expertise that most machine learning engineers simply do not have - knowledge drawn from physics, art, and cinematography.

Spotting physical flaws in generated video - impossible shadows, violated conservation of momentum, incorrect fluid dynamics, cloth that ignores gravity - requires trained intuition. A physicist sees a ball that decelerates wrong. A cinematographer sees lighting that is inconsistent with the implied camera setup. An animator sees motion that violates the principles of weight and follow-through. These judgments are fast, precise, and extremely difficult to formalize. They come from decades of calibrated visual experience, not from reading papers.

Film directors and cinematographers routinely catch errors that state-of-the-art automated systems miss entirely. They have spent careers developing a sense for what looks wrong and why it looks wrong - the kind of structured visual understanding that verification demands. A director does not just notice that a shot feels off. They can tell you it is because the eyeline is two degrees too high, or because the depth of field implies a focal length that contradicts the apparent lens distortion.

The people building the reward signal need to understand the domain as deeply as the people building the models. This is non-trivial for many teams but it is essential. You cannot build a verifier for physical correctness without people who understand physics. You cannot build a verifier for visual coherence without people who understand how images are composed and perceived. The verification problem is, at its core, an interdisciplinary problem - and solving it requires assembling expertise that does not naturally cluster in ML research labs.

The convergence makes it urgent

If video generation and video understanding were separate fields with separate models, verification could remain an academic problem. But they are converging - rapidly and inevitably - into unified omni-models that both see and generate.

Every frontier lab is building in this direction. The architecture makes it natural: a single transformer backbone that can condition on video to produce text, or condition on text to produce video, or condition on video to produce video. Understanding and generation become two modes of the same model.

This convergence is the path to visual AGI. A system that can both perceive the world and imagine counterfactuals within it - that can watch a video of a physical process and then generate what would happen if a variable changed - is a system that has begun to build a world model. This is far more powerful, and far more dangerous, than either capability in isolation.

And it cannot be aligned without verification infrastructure that operates at the same level of sophistication as the models themselves.

You cannot align what you cannot evaluate. And we cannot yet evaluate video AI.

The language alignment community learned this lesson painfully: reward hacking, mode collapse, sycophancy, and other failure modes emerged because the reward models were cruder than the systems they were supposed to supervise. In video, we are starting from a much worse position. The verification gap is not a few years behind - it is a paradigm behind.

What a verification layer looks like

Solving this requires a new kind of infrastructure: expert-calibrated reward models that can evaluate video along multiple independent dimensions - physical plausibility, temporal coherence, semantic fidelity, aesthetic quality - and provide structured, dense training signals rather than scalar preferences.

It requires RL training environments purpose-built for video, where models can be trained against rich, multi-dimensional feedback that captures the actual structure of visual correctness.

And it requires doing this in partnership with the labs that are building the omni-models, not in isolation. The verification layer cannot be an afterthought bolted onto a training pipeline. It must be co-designed with the models it evaluates, because the space of failures is model-dependent and evolves as the models improve.

This is what we are building at Philo Labs. We believe verification is the hardest unsolved problem in AI alignment - not because language alignment is easy, but because visual alignment is fundamentally harder, less understood, and more urgent than the field has recognized.

The generation problem attracted billions of dollars and the best minds in the field. The verification problem has received a fraction of that attention. This is the gap we intend to close.