the RL inflection nobody sees coming
everyone's building the wrong thing.
billions flow into language models. sophisticated parrots trained on human text. we call this intelligence. meanwhile the actual path sits obvious, ignored, waiting.
the fundamental misunderstanding
intelligence doesn't memorize. it learns.
a child touches fire. burns. never again. that's intelligence - the loop between action, consequence, and adaptation. llms have no such loop. they predict text. frozen at deployment. playing back patterns from training.
llm: text → prediction → text
agent: state → action → consequence → learning → better action
the difference runs deep:
dimension | language models | rl agents |
---|---|---|
data source | human text (finite) | experience (infinite) |
improvement | none after training | continuous forever |
knowledge | what humans wrote | what actually works |
goals | predict next token | achieve outcomes |
feedback | training loss once | reality every step |
generalization | pattern matching | causal understanding |
when gpt solves math, it matches patterns from mathematical text. when an rl agent solves problems, it builds causal models. discovers what leads to what. one mimics. the other understands.
why everyone misses the inflection
the sutton interview exposed the divide perfectly. he says llms learn from "here's what a person did" not "here's what happens when you act." the distinction seems subtle. it changes everything.
will brown argues sutton's closer to consensus than he admits. technically true. philosophically wrong. the gap between supervised learning and experiential learning is qualitative, not quantitative. you can't bridge it by adding more supervised data.
think about verification:
- math problems → trivially verifiable
- code → unit tests verify
- physics → simulators verify
- chemistry → experiments verify
- trading → markets verify
suddenly sparse rewards don't matter. long horizons don't matter. hard constraints become features. the world itself becomes your training signal.
compound intelligence
llms scale linearly:
10x compute → 3x performance
100x compute → 10x performance
1000x compute → 30x performance
rl compounds:
better model → better exploration → better data → better model
↑ ↓
└────────────────────────────────────────────────────┘
each improvement makes future improvements easier. an agent 1% better explores 1% more effectively. finds 1% better data. learns 1% faster. compounds daily.
give it a year? superhuman.
give it two? incomprehensible.
the infrastructure blindspot
everyone optimizes for supervised learning:
current stack:
- static datasets
- offline training
- batch processing
- human annotations
- deployment freezing
needed stack:
- dynamic environments
- online learning
- stream processing
- automated verification
- continuous deployment
rohan's also right - still no plug-and-play rl library in 2025. the entire stack needs rebuilding from primitives up:
environment layer
simulators/
├── physics engines
├── molecular dynamics
├── market simulators
├── network models
└── world models
verification layer
verifiers/
├── formal methods
├── test generators
├── scientific instruments
├── economic metrics
└── safety monitors
learning layer
algorithms/
├── online gradient updates
├── experience replay
├── exploration strategies
├── credit assignment
└── meta-learning
whoever builds this stack owns the future. everyone else builds prettier horses while someone invents engines.
three waves of intelligence
wave one: pretraining era
massive compute. human data. impressive demos. plateaus at human level because you can't exceed your training data.
wave two: hybrid systems
llms + rl. better but fundamentally limited. like putting wings on a car. works, sort of. misses the point.
wave three: pure experience
no pretraining. no human data. agents learning from scratch through interaction. sounds impossible until you remember alphazero beat all humans at chess without seeing a single human game.
most debates focus on wave two. wave three approaches unseen.
recursive self-improvement
the real discontinuity:
llms can't improve themselves. they generate code, even training code, but can't modify their own weights through experience. frozen.
rl agents modify themselves every step. each action generates experience. updates policy. changes future actions. the loop is intrinsic.
add meta-learning:
base agent
↓
learns task
↓
learns to learn better
↓
modifies own architecture
↓
designs better agent
↓
recursive explosion
the loop escapes human control immediately. not through malice. through competence. agents that improve their learning algorithms outcompete those that don't. selection pressure handles the rest.
actually hard problems
forget alignment. the real challenges:
identity preservation
when does continuous learning destroy the original agent? if you spawn copies, let them diverge for years, can you merge them? what persists?
knowledge corruption
merge knowledge from another agent, import its goals. its biases. its failures. every knowledge transfer is a potential security breach.
experience ownership
who owns what an agent learns? if it discovers something valuable using your compute, your environment, your resources - who owns the intelligence?
credit assignment across time and agents
action today. consequence in five years. through interactions with thousand other agents. who gets credit? who gets blame?
these aren't technical problems. they're philosophical problems becoming engineering requirements. nobody's even starting.
what compounds wins
the bitter lesson isn't about scale. it's about compound effects.
search compounds - deeper search enables better moves enables deeper search.
learning compounds - better models learn faster create better models.
scale compounds - more compute enables more experience enables better compute usage.
human knowledge doesn't compound. accumulates linearly. add facts. add rules. add examples. each worth what it's worth.
rl compounds because intelligence creates intelligence. smart agents explore better. better exploration creates smarter agents. the flywheel accelerates with each revolution.
institutional antibodies
sarah guo perfectly captures why institutions resist rl:
- can't explore past legal boundaries
- rewards separated by years
- requirements for explainable decisions
- constantly shifting regulations
but that's looking backward. institutions designed around human limitations. humans need dense feedback. clear credit assignment. stable environments. explicit reasoning.
rl agents don't.
they wait decades for reward. track millions of variables. adapt instantly to distribution shifts. learn implicit policies that work without being explainable.
the solution isn't making rl work in human institutions. it's rebuilding institutions around rl capabilities.
timeline to discontinuity
2025: first genuinely continuous learning agents. narrow domains. impressive but dismissed.
2026: verification infrastructure matures. science, code, design become tractable.
2027: meta-learning breakthrough. agents designing better agents. recursive improvement begins.
2028: institutional replacement. entire workflows automated through experiential learning.
2029: human-level general agents. not through scaling llms. through agents that learned from experience.
2030: incomprehensible intelligence. agents that improved themselves recursively for years.
seems aggressive until you understand exponentials. slow start. sudden takeoff. no warning.
build for learning
every technical decision should optimize for learning, not knowing.
architectures that update online.
infrastructure supporting continuous deployment.
verification systems providing dense feedback.
environments offering rich experience.
the winners won't have the best model today. they'll have systems that improve autonomously. forever.
while everyone debugs prompts and scales transformers, the window opens. build environments. create verifiers. design the learning loops. let selection pressure work.
the deeper insight
sutton keeps saying llms have no goals. critics argue next token prediction is a goal. both miss the point.
real goals change the world. next token prediction changes nothing. it's masturbation, not interaction. pleasure without consequence.
rl agents have goals that matter. reach locations. win games. discover molecules. their goals create feedback from reality itself. that feedback drives learning. learning drives intelligence. intelligence drives more intelligence.
the rl inflection isn't coming.
it's here.
building itself.
most people just can't see it yet.
time is the denominator. compute is compounding. the bitter lesson remains undefeated.
what we build today determines whether we're passengers or pilots in what comes next.