world models are learning to see

27 Sep, 2025

ChatGPT Image Sept 27 2025 from Reinforcement Learning in Art

video generation was supposed to be about entertainment. making movies. creating content.

turns out we accidentally built something else entirely.

the accidental breakthrough

veo3 can solve mazes. segment objects. understand physics. simulate tool use. complete visual analogies. none of this was in the training objective. the model learned to see by learning to generate.

this shouldn't surprise us, yet it does. when you train a model to predict the next frame of video, you force it to understand:

how objects move through space
what happens when things collide
how materials behave under stress
which transformations preserve identity
what affordances objects provide

generation requires understanding. you can't fake physics at 30 frames per second.

chain-of-frames: thinking through time

language models think in tokens. step by step. each token building on the last.

video models think in frames. moment by moment. each frame constraining the next.

text reasoning:    token₁ → token₂ → token₃ → answer
visual reasoning:  frame₁ → frame₂ → frame₃ → solution

but frames carry more information than tokens. exponentially more.

dimension	tokens	frames
information density	~10 bits	~10⁶ bits
causal constraints	grammatical	physical
verification	semantic consistency	laws of physics
search space	discrete symbols	continuous reality

when a video model generates frames solving a maze, it's not following instructions. it's simulating the solution. the difference matters.

visualization changes everything

the mind's eye paper reveals something profound. when llms generate visual representations of problems, their spatial reasoning improves. not marginally. dramatically.

they're not just thinking about space. they're thinking through space.

problem → visualization → reasoning → solution
           ↑                          ↓
           └──────────────────────────┘

this loop exists in humans. we visualize to understand. now it exists in machines.

but here's the twist: video models don't need to be prompted to visualize. visualization is their native mode. they think in visual sequences by default.

three types of understanding

symbolic understanding (llms)
manipulates tokens. follows rules. produces text that describes understanding without possessing it.

grounded understanding (rl agents)
learns through interaction. builds causal models. understands consequences but only within experienced domains.

visual understanding (video models)
simulates possibilities. explores counterfactuals. understands through generation of plausible futures.

each has limits:

symbolic systems can discuss things they've never experienced
grounded systems understand only what they've done
visual systems understand what they can imagine

the breakthrough comes from combination.

why generation enables reasoning

to generate realistic video, you must model:

physics constraints

momentum conservation
gravity effects
collision dynamics
material properties
fluid behavior

semantic coherence

object permanence
identity preservation
causal relationships
temporal consistency
affordance recognition

spatial relationships

depth ordering
occlusion handling
perspective transforms
lighting consistency
shadow physics

you can't memorize this. the combinatorial space is too large. you must learn the underlying rules.

generation forces understanding in ways recognition never could.

the implicit world model

every video model contains a world model. implicit. distributed. never declared but always present.

when veo3 generates frames of water filling a glass: this knowledge wasn't labeled. wasn't supervised. emerged from predicting pixels.

it knows water flows down
it knows glass is transparent
it knows water takes container shape
it knows the level rises continuously
it knows bubbles might form

emergent capabilities cascade

the pattern repeats across modalities:

model type	trained on	emerges with
language models	next token	reasoning, translation, coding
video models	next frame	physics, planning, tool use
audio models	next sample	harmony, rhythm, structure
protein models	next residue	folding, function, interaction

scale + self-supervision + generation = understanding

the recipe works everywhere.

verification through generation

video models solve a fundamental problem: verification without interaction.

rl agents must try actions to learn consequences. expensive. dangerous. slow.

video models simulate consequences. generate outcomes. explore counterfactuals. all without touching reality.

imagination loop:
state → action → generated_outcome → evaluation
  ↑                                      ↓
  └──────────────────────────────────────┘

this loop runs at inference speed. thousands of possibilities explored per second.

alphago imagined millions of games. video models will imagine millions of futures.

the convergence

three paths to intelligence converging:

path 1: language (symbolic reasoning)
broad but shallow. knows about everything. understands nothing deeply.

path 2: interaction (embodied learning)
narrow but deep. understands specific domains. can't transfer broadly.

path 3: visualization (simulated experience)
broad and deep. imagines possibilities. grounds understanding without interaction.

the union transcends the parts:

language provides goals and context
visualization explores possibilities
interaction validates and refines

what breaks next

robotics
robots that imagine consequences before acting. preview futures. select optimal paths. no training required.

science
models that visualize molecular interactions. predict reactions. design materials. all through generation.

reasoning
systems that think by simulating. solve problems by imagining solutions. debug by visualizing failure modes.

planning
agents that generate possible futures. evaluate outcomes. optimize paths. all without real-world trials.

video understanding isn't auxiliary to intelligence. it's fundamental.

the deeper insight

we've been thinking about world models wrong.

world models aren't databases of facts. aren't collections of rules. aren't symbolic representations.

world models are generators. they produce possible worlds. create plausible futures. simulate realities.

understanding means being able to imagine accurately.

veo3 demonstrates this. it understands physics not because it memorized equations but because it can generate physically plausible futures. it solves mazes not through search algorithms but through visual simulation.

build for imagination

the implications cascade:

architecture design
optimize for generation quality, not recognition accuracy. generation subsumes recognition.

training objectives
predict futures, not labels. simulation teaches more than classification.

evaluation metrics
measure plausibility of generated worlds, not accuracy on benchmarks.

data requirements
video contains more information than text. every frame teaches physics. every sequence teaches causality.

the timeline accelerates

2025: video models solve increasingly complex reasoning tasks through pure generation

2026: robotics revolutionized by models that preview actions before execution

2027: scientific discovery accelerates through molecular simulation at scale

2028: hybrid systems combining language, vision, and interaction achieve general intelligence

2029: reality simulation indistinguishable from observation

the boundaries dissolve. language models learn to see. video models learn to reason. interactive agents learn to imagine.

what everyone's missing

the race isn't for better language models or better rl agents.

the race is for better world simulators.

whoever builds the most accurate generator of possible worlds wins. not because generation is the goal. because generation is understanding.

veo3 isn't just predicting pixels. it's learning the source code of reality. frame by frame. moment by moment.

when ai can imagine as accurately as it can observe, the distinction between intelligence and simulation disappears.

video models aren't learning to generate. they're learning to see.

seeing isn't passive observation. it's active simulation of possible worlds.

the future belongs to systems that can imagine it.