We handed a CVPR-winning AI a game it had never seen

NitroGen got a CVPR best-paper honorable mention for a neat trick: it plays video games from raw pixels. It looks at the screen, outputs gamepad actions, and uses no game-specific code. We had a fun question to fill a long flight. Our own game looks nothing like what NitroGen trained on, but it plays the same way: same twin-stick movement, same aim-and-shoot. Does a pixels-only skill carry over when only the looks change? Short answer: not for free. Zero-shot, it struggles. So we tried teaching it, with Claude as the tutor, and watched it nearly catch up in about seven minutes.

The setup

A game it had never seen, but should recognize

The test game is Arena Breaker: a 3D twin-stick survival arena. Move with the left stick, aim and auto-fire with the right, dash to dodge. Three stages, a handful of enemy types, some saws and spikes to avoid. It is exactly the kind of game NitroGen knows how to play. The only thing that is new is what it looks like. That is the whole point: if the model really learned to play, and did not just memorize specific pixels, this should be easy.

Arena Breaker, shown with an orbiting camera: the three stages, the six-enemy roster, and the saw and spike hazards — the actual Godot assets and JSON-defined stage data. Same twin-stick play as the games NitroGen trained on, just different pixels.

The cold start

Out of the box, the skill doesn't carry over

We dropped NitroGen into Arena Breaker cold. It takes the last rendered frame (256×256 pixels) and outputs an 18-frame gamepad chunk. On our game it hasn't found its footing yet: it drifts, dashes at odd moments, and barely nudges the aim stick: magnitude about 0.025, nowhere near the 0.3 it takes to fire. So it rarely shoots. Final score: 0 kills, out in stage 1.

So the skill doesn't come for free. The kind of game was familiar, the pixels were not, and that gap was enough to throw it. Fair enough — could we teach it the rest before we landed?

Zero-shot NitroGen. It can see the arena fine. It just doesn't yet recognize this as a game it already knows how to play. 0 kills.

The tutor

We let Claude play, by cheating

To teach NitroGen, we needed someone to show it how. Enter Claude Opus 4.8. Here is the cheat: Claude does not look at the screen at all. It plays from a compact structured state, a JSON snapshot of the world each turn, and replies with a move vector, an aim vector, and a dash flag. This is everything it sees:

{
  "player": { "pos": {"x": -3.1, "z": 4.2}, "hp": 5, "dash_ready": true,
              "aim": {"x": 0.0, "z": 0.0} },
  "enemies": [                          // nearest first, with type + relative offset
    {"type": "swarmer", "dx": 1.8, "dz": -0.4, "dist": 1.8, "hp": 1},
    {"type": "shooter", "dx": -6.0, "dz": 2.1, "dist": 6.4, "hp": 2}
  ],
  "enemy_projectiles": [ {"dx": 2.0, "dz": 0.1, "vx": -8.0, "vz": 0.0} ],
  "hazards":           [ {"type": "saw", "dx": 3.4, "dz": 1.0, "radius": 0.85} ],
  "stage": 2, "wave": 3, "score": 210
}

This is the cheat in full: the game pauses each turn while Claude reads the exact world state and reasons about its next move, so it is not playing in real time. But it does not have to be. Its only job is to produce perfect trajectories for the student to copy. And it does: reading exact positions, Claude plays like a calm expert, keeping its distance, kiting enemies, and lining up clean shots. We let it run and recorded every move: 17 episodes across all three stages, about 9 kills each. That is our entire training set, collected in the time it takes to watch a movie.

The tutor at work. Claude across stages 1 to 3, driven by structured state rather than pixels. These clips are the recordings NitroGen learns from.

The bootcamp

Seven minutes of homework

A little background on the student first. NitroGen is a vision → gamepad foundation model: a vision encoder feeding an action diffusion transformer, pre-trained by behavior cloning on roughly 40,000 hours of gameplay video across 1,000+ games scraped off the internet. Its whole claim to fame is adaptation — the paper shows that post-training that base on a game it never saw beats training from scratch by up to 52% in the low-data regime. Arena Breaker is exactly that kind of unseen game, so we ran that same post-training, with one twist.

The paper's demonstrations come from human playthroughs. Ours come from Claude. That is what makes this policy distillation rather than plain imitation: a reasoning agent that reads privileged state, teaching a reactive model that only gets pixels. But the mechanics are just “show, don't tell”. We render each recorded moment to the 256-pixel frame NitroGen expects and pair it with the next 18 frames of Claude's gamepad actions. From 17 episodes that gives 3,276 small “what you see, what to do” examples.

We follow the paper's recipe where it counts. We drop the idle stretches — keeping only segments where at least half the frames carry real input, or the model just learns to stand still — and we train on HUD-less frames, hiding the on-screen controller so it can't cheat by reading the keymap. Then we freeze the part that sees (those features already generalize) and retrain only the part that acts, the Action DiT and action head: 127M of 494M parameters. Four passes over the data, a single A10G GPU, about 7 minutes. Roughly one in-flight snack.

base model:  NitroGen — behavior-cloned on ~40,000 h of video, 1,000+ games
adaptation:  post-train on ONE unseen game  (paper: up to +52% vs from-scratch)
demos:       Claude Opus 4.8 (reasoning agent)  →  distilled into pixels→gamepad
data:        17 episodes → 3,276 samples (256px frame + 18-action chunk), idle-filtered
trainable:   127M / 494M params  (Action DiT + action head; vision encoder frozen)
recipe:      behavior cloning, 4 epochs, ~7 min on 1× A10G

The payoff

It learned to fight

Same pixels, new weights, and suddenly it is aiming and shooting. The aim stick goes from a dead 0.025 to a confident 0.86, and the kills stop being zero. The skill did not carry over for free, but it carried over fast.

metric	zero-shot	fine-tuned
avg kills	0.0	2.7
aim-stick magnitude	0.025	0.86
avg score	10	37

Side by side: zero-shot (left) vs fine-tuned (right). Same model, same pixels. The right one learned to fight from Claude's example.

Then the student did something the teacher never could: it walked straight into a bug we did not know we had. Point-blank shots were quietly passing through overlapping enemies. Claude never noticed, because it keeps its distance and so never fired at point-blank range. But NitroGen gets swarmed, so it found the bug right away. Our model basically filed our first bug report.

We fixed the collision and re-ran the same fine-tuned weights, with no extra training. It jumped to 6.8 kills and cleared the first wave every run. Claude, the tutor, averages 7.9. So a model that had never seen this game, taught in seven minutes on a plane, is now playing within a hair of its teacher. And unlike its teacher, it plays in real time, straight from pixels, with no pausing to read the world state.

And the habits transfer. Drop it into a tougher, darker arena it was never evaluated on, and it still strafes, fires, and dashes away from incoming shots before it is eventually overwhelmed.

Fine-tuned NitroGen on Stage 2, a tougher, darker arena. Watch the overlay: it repeatedly slams SPACE to dash clear of enemy fire (16 dodges in this run) while keeping the aim stick on target. The learned habits hold up on a level it was never evaluated on.

The view from the window

Earth Online, render distance: forever

We did all of this somewhere over Arizona, watching the desert scroll past the window at 35,000 feet. After a few hours of teaching one model to see a rendered world, the one out the window started to feel like a render too: an open-world map streaming past, sun and shadows updating in real time, the best graphics anyone has ever shipped. We are all just NPCs grinding the long flight in Earth Online, the multiplayer RPG nobody can quit. Our little student learned to play its game in seven minutes. We are still figuring out ours.

Arizona desert seen from a plane window — Arizona rendering past the window at 35,000 feet.

A flight-length experiment, start to finish on one plane ride. The lesson we took home: a pixels-only skill does not transfer to a new-looking game for free, but it does not have to. A reasoning agent that can read the world state plays well enough to be a teacher, and a few minutes of its example is enough to hand that skill to a model that runs in real time from pixels alone.