world models are learning to see
video generation was supposed to be about entertainment. making movies. creating content.
turns out we accidentally built something else entirely.
the accidental breakthrough
veo3 can solve mazes. segment objects. understand physics. simulate tool use. complete visual analogies. none of this was in the training objective. the model learned to see by learning to generate.
this shouldn't surprise us, yet it does. when you train a model to predict the next frame of video, you force it to understand:
- how objects move through space
- what happens when things collide
- how materials behave under stress
- which transformations preserve identity
- what affordances objects provide
generation requires understanding. you can't fake physics at 30 frames per second.
chain-of-frames: thinking through time
language models think in tokens. step by step. each token building on the last.
video models think in frames. moment by moment. each frame constraining the next.
text reasoning: tokenβ β tokenβ β tokenβ β answer
visual reasoning: frameβ β frameβ β frameβ β solution
but frames carry more information than tokens. exponentially more.
dimension | tokens | frames |
---|---|---|
information density | ~10 bits | ~10βΆ bits |
causal constraints | grammatical | physical |
verification | semantic consistency | laws of physics |
search space | discrete symbols | continuous reality |
when a video model generates frames solving a maze, it's not following instructions. it's simulating the solution. the difference matters.
visualization changes everything
the mind's eye paper reveals something profound. when llms generate visual representations of problems, their spatial reasoning improves. not marginally. dramatically.
they're not just thinking about space. they're thinking through space.
problem β visualization β reasoning β solution
β β
ββββββββββββββββββββββββββββ
this loop exists in humans. we visualize to understand. now it exists in machines.
but here's the twist: video models don't need to be prompted to visualize. visualization is their native mode. they think in visual sequences by default.
three types of understanding
symbolic understanding (llms)
manipulates tokens. follows rules. produces text that describes understanding without possessing it.
grounded understanding (rl agents)
learns through interaction. builds causal models. understands consequences but only within experienced domains.
visual understanding (video models)
simulates possibilities. explores counterfactuals. understands through generation of plausible futures.
each has limits:
- symbolic systems can discuss things they've never experienced
- grounded systems understand only what they've done
- visual systems understand what they can imagine
the breakthrough comes from combination.
why generation enables reasoning
to generate realistic video, you must model:
physics constraints
- momentum conservation
- gravity effects
- collision dynamics
- material properties
- fluid behavior
semantic coherence
- object permanence
- identity preservation
- causal relationships
- temporal consistency
- affordance recognition
spatial relationships
- depth ordering
- occlusion handling
- perspective transforms
- lighting consistency
- shadow physics
you can't memorize this. the combinatorial space is too large. you must learn the underlying rules.
generation forces understanding in ways recognition never could.
the implicit world model
every video model contains a world model. implicit. distributed. never declared but always present.
when veo3 generates frames of water filling a glass: this knowledge wasn't labeled. wasn't supervised. emerged from predicting pixels.
- it knows water flows down
- it knows glass is transparent
- it knows water takes container shape
- it knows the level rises continuously
- it knows bubbles might form
emergent capabilities cascade
the pattern repeats across modalities:
model type | trained on | emerges with |
---|---|---|
language models | next token | reasoning, translation, coding |
video models | next frame | physics, planning, tool use |
audio models | next sample | harmony, rhythm, structure |
protein models | next residue | folding, function, interaction |
scale + self-supervision + generation = understanding
the recipe works everywhere.
verification through generation
video models solve a fundamental problem: verification without interaction.
rl agents must try actions to learn consequences. expensive. dangerous. slow.
video models simulate consequences. generate outcomes. explore counterfactuals. all without touching reality.
imagination loop:
state β action β generated_outcome β evaluation
β β
ββββββββββββββββββββββββββββββββββββββββ
this loop runs at inference speed. thousands of possibilities explored per second.
alphago imagined millions of games. video models will imagine millions of futures.
the convergence
three paths to intelligence converging:
path 1: language (symbolic reasoning)
broad but shallow. knows about everything. understands nothing deeply.
path 2: interaction (embodied learning)
narrow but deep. understands specific domains. can't transfer broadly.
path 3: visualization (simulated experience)
broad and deep. imagines possibilities. grounds understanding without interaction.
the union transcends the parts:
- language provides goals and context
- visualization explores possibilities
- interaction validates and refines
what breaks next
robotics
robots that imagine consequences before acting. preview futures. select optimal paths. no training required.
science
models that visualize molecular interactions. predict reactions. design materials. all through generation.
reasoning
systems that think by simulating. solve problems by imagining solutions. debug by visualizing failure modes.
planning
agents that generate possible futures. evaluate outcomes. optimize paths. all without real-world trials.
video understanding isn't auxiliary to intelligence. it's fundamental.
the deeper insight
we've been thinking about world models wrong.
world models aren't databases of facts. aren't collections of rules. aren't symbolic representations.
world models are generators. they produce possible worlds. create plausible futures. simulate realities.
understanding means being able to imagine accurately.
veo3 demonstrates this. it understands physics not because it memorized equations but because it can generate physically plausible futures. it solves mazes not through search algorithms but through visual simulation.
build for imagination
the implications cascade:
architecture design
optimize for generation quality, not recognition accuracy. generation subsumes recognition.
training objectives
predict futures, not labels. simulation teaches more than classification.
evaluation metrics
measure plausibility of generated worlds, not accuracy on benchmarks.
data requirements
video contains more information than text. every frame teaches physics. every sequence teaches causality.
the timeline accelerates
2025: video models solve increasingly complex reasoning tasks through pure generation
2026: robotics revolutionized by models that preview actions before execution
2027: scientific discovery accelerates through molecular simulation at scale
2028: hybrid systems combining language, vision, and interaction achieve general intelligence
2029: reality simulation indistinguishable from observation
the boundaries dissolve. language models learn to see. video models learn to reason. interactive agents learn to imagine.
what everyone's missing
the race isn't for better language models or better rl agents.
the race is for better world simulators.
whoever builds the most accurate generator of possible worlds wins. not because generation is the goal. because generation is understanding.
veo3 isn't just predicting pixels. it's learning the source code of reality. frame by frame. moment by moment.
when ai can imagine as accurately as it can observe, the distinction between intelligence and simulation disappears.
video models aren't learning to generate. they're learning to see.
seeing isn't passive observation. it's active simulation of possible worlds.
the future belongs to systems that can imagine it.