residue

the cost of knowing

ChatGPT Image Jan 2 2026 (3)

i've spent the last year building AI evaluation systems. the work is unglamorous—designing test suites, measuring agent behavior, figuring out what "correct" means for tasks that don't have clean answers. most of my time goes into a problem that sounds boring until you realize it's the whole game: how do you know if AI did something right?

everyone's focused on capability. METR's time horizon research shows agents completing four-hour tasks, up from nine minutes two years ago. doubling every seven months. the extrapolations write themselves. week-long autonomous work by 2028. month-long projects shortly after.

almost nobody's watching what happens to verification.

verification has its own scaling curve. it grows faster than task length. superlinearly. at some point the cost of knowing whether work was done right approaches the cost of doing the work. when that happens, capability stops mattering.

that crossing is closer than most forecasts assume.


two scores, one task

METR's headline—four hours, doubling every seven months—dominates the discourse. the paper has a second finding that matters more.

they scored outputs two ways. algorithmic: automated checks like compilation, unit tests, format validation. holistic: human contractors evaluating whether work is actually usable. could you merge this code. does this analysis hold up. would this solution work.

short tasks: scores converge. an agent writing a function that passes tests generally produces something a reviewer accepts.

longer tasks: scores diverge. agents clear every automated gate but fail human review. formatting misses conventions. test coverage exists but skips cases that matter. documentation explains the wrong things. code works but you wouldn't ship it.

the gap grows systematically with duration. not noise. pattern.

usual interpretation: benchmarks overestimate capability. true. but incomplete. why does the gap grow? what's the underlying dynamic?

the answer lives in what verification actually costs.


context has weight

a one-line change takes seconds to verify. read the diff, confirm it matches the commit message, run a test. context fits in working memory.

a function addition takes minutes. understand what it should do, trace the logic, check edge cases, verify integration. context expands to the module's purpose and caller expectations.

a multi-file refactor takes an hour or more. architectural context. does the abstraction make sense. are dependencies appropriate. did undocumented invariants survive. you might need someone who knows the system's history.

a day-long feature takes several hours to verify. understanding purpose, checking implementation against intent, testing realistic conditions, reviewing security and performance. focused attention from someone qualified.

the ratio shifts as tasks lengthen. ten minutes of work: thirty seconds to verify. 5% overhead. ten hours of work: three hours to verify. 30% overhead.

verification requires context. longer tasks embed more context—decisions, assumptions, component interactions. extracting that context from outputs has its own cost, and that cost compounds.

doubling task length more than doubles verification cost.


the crossing

METR's capability curve keeps climbing. late 2026: eight-hour tasks. 2027: multi-day workflows. 2028: week-long projects.

verification follows a steeper curve.

eight-hour task, two to four hours verification. substantial overhead, but net leverage.

three-day task, full day verification. ratio tightening. a third of execution time just confirming correctness.

week-long task, verification approaches execution time. the value proposition inverts. checking AI work costs as much as doing it yourself. you've shifted effort from creation to verification—and verification is harder. you're working backward from outputs instead of forward from intent.

the crossing point varies by domain. clear specs, comprehensive tests, observable outcomes push it further out. ambiguous goals, subjective quality, complex integration pull it closer.

every domain has a crossing. the curves intersect somewhere. that intersection is the ceiling on useful autonomous AI.


what benchmarks can't see

benchmarks need tractable verification. otherwise they couldn't score thousands of attempts.

SWE-bench: run the test suite. seconds.

MMLU: string match. negligible.

HumanEval: execute code against tests. compute only.

math olympiad: check proofs. expensive, but bounded by proof length—competitions constrain this by design.

these benchmarks exist because verification is cheap. we evaluate at scale because each evaluation costs almost nothing.

we cannot benchmark expensive-verification tasks. "build a competitive analysis." "design system architecture." "write a research proposal." each attempt needs hours of expert review. no dataset. no leaderboard. no scaling laws.

the benchmarks we have sample from a biased distribution: tasks where automated verification works. that's not a flaw—it's a constraint on what benchmarking can measure.

when agents crush SWE-bench but struggle in production, the gap is verification cost. SWE-bench has cheap verification by construction. production often doesn't.


the demo illusion

2025's pattern: agents brilliant in demos, disappointing in deployment. every major company experienced this. explanations focus on capability—demos are cherry-picked, production has edge cases, users have ambiguous requirements.

the verification lens reveals something else.

demos are built by people who know what success looks like. they designed the workflow. tested iterations. present something that works. verification happened during development, amortized across attempts. by showtime, confidence is established.

production deploys to users with goals but no specifications. "research competitors and find opportunities." "summarize these documents and surface themes." "plan a project and break it into tasks." agent attempts something. user receives output. now verification falls to them.

that means understanding what the agent did. evaluating whether it matches unstated requirements. deciding if output is usable. for complex tasks, reconstructing reasoning, checking sources, assessing whether framing fits context.

"send meeting notes to my team" has cheap verification. read the email. confirm it captures the discussion. thirty seconds.

"research our market and recommend strategy" has expensive verification. was research thorough. did it include the right competitors. are recommendations viable given constraints the agent doesn't know. does framing match how leadership thinks. verifying might take as long as doing the work. maybe longer—you're reverse-engineering instead of building forward.

the tasks where agents have capability are often the tasks where users can't easily check results.


two curves, one ceiling

multi-step reliability decays exponentially. 95% per step yields 36% over twenty steps. well understood.

verification has its own compounding dynamic.

when step-by-step verification costs too much, you can only verify final outputs. but output verification for long chains requires understanding the whole chain. assumptions made early. context established. intermediate decisions shaping results.

alternatively, verify intermediate steps. step seven makes sense given one through six. step eight makes sense given one through seven. but step seven's verification cost carries accumulated complexity from prior steps. each check inherits context.

two curves compound:

at some chain length, they cross destructively. reliability low enough that failures are common. verification expensive enough that checking efficiently is impossible. recovery means understanding where things went wrong—itself a verification problem.

the practical ceiling on autonomous systems lives here. not capability limits. the information-theoretic cost of confirming extended autonomous operation produced correct results.


why physics wins

lecun leaving meta for world models at $5 billion valuation usually reads as architecture. LLMs predict text; world models predict physics. different training signal, different capabilities.

the verification economics differ too.

text is hard to verify at scale. grammar, factual accuracy, coherence—you can check those. "does this text reflect understanding" or "would this strategy work" have no cheap tests. text verification needs human judgment, which is expensive and doesn't parallelize.

physical prediction verifies by default. ball falls where physics says. robot grasps cup or drops it. drone navigates or crashes. reality provides ground truth. verification cost for physical prediction is often just execution cost—incurred anyway.

world models don't eliminate verification. knowing whether a robot's action was correct doesn't tell you whether its goal was correct. planning and intent stay expensive.

but they push the crossing further out for physical execution. the gap between "passes automated checks" and "actually works" shrinks when physics arbitrates.

fei-fei li's World Labs and embodied AI broadly make sense here. physical AI gets cheaper verification structurally. an advantage the crossing can't erode as fast.


bits of trust

interpretability is usually justified on safety. understand what models do to catch misalignment early.

there's a verification argument.

trace how a system reached output, verification gets cheaper. instead of checking output against all possible failures, check reasoning steps. each step is a narrower target.

"the model concluded X" forces direct verification of X. expensive if X is complex.

"the model concluded X because it identified Y and inferred Z from Y" lets you verify Y and Y→Z separately. wrong on either, you know where failure is. right on both, higher confidence than verifying X directly.

circuits-style interpretability decomposes behavior into legible components. each component becomes a verification checkpoint. decomposition reduces bits needed to confirm or reject.

the crossing is information-theoretic. how many bits to confirm correctness? interpretability compresses that quantity. readable traces are cheaper to verify than opaque outputs.

this reframes interpretability from safety nicety to deployment necessity. beyond some complexity, uninterpretable systems can't deploy—not because dangerous, because verification costs make them uneconomical.


building for the constraint

if verification cost determines the ceiling, product implications follow.

decomposition over monoliths. an agent completing a four-hour task as one unit might be less useful than one completing it as eight thirty-minute subtasks with checkpoints. subtasks have cheaper verification. checkpoints catch failures early. total verification cost is lower even if capability is identical.

structure as infrastructure. free-form output is harder to verify than structured templates with explicit sections, sources, confidence levels. structure compresses what needs checking. formatting preferences are actually verification economics.

domain specificity wins. narrow domains have cheaper verification. code that compiles and passes tests is easier to verify than prose that must be insightful. medical diagnosis with clear imaging criteria is easier than legal analysis with judgment calls. the crossing varies by domain. winning means picking domains where it's far out.

humans as verification arbitrage. humans are expensive but efficient at certain verification types AI can't automate. optimal deployment interleaves AI execution with human verification at points where human checking is cheapest relative to value delivered.

evaluation is upstream. you can't improve what you can't measure. you can't deploy what you can't verify. organizations building cheap scalable verification for their domains unlock deployment others can't reach.


the frontier

three problems define where verification research needs to go.

compositional verification. twenty steps currently means something like twenty checks, each carrying context forward. can systems where verifying composition is cheaper than verifying components exist? this requires systems producing their own verification artifacts—structured proofs, compressed summaries, checkpoints isolating what needs checking. systems that package what about their operation requires verification.

verification markets. if verification has cost, it can have price. staking mechanisms where agents commit value against outcomes—lost if verified wrong. reputation systems where verified outputs carry forward, reducing future burden. specialized verifiers competing to check efficiently. verification is a credence good—you don't know you need it until after. markets for credence goods are hard but not impossible.

learned verification. humans verify through pattern recognition. senior engineers don't trace every line—they recognize correctness patterns and bug anti-patterns. can models learn this? verifiers smaller, cheaper, differently architected than generators? verification can be easier than generation—easier to recognize solutions than produce them. circular only if verifiers need generator capabilities. they might not.

none solved. directions where solutions might exist.


what's coming

agents will proliferate. every company deploying them. every workflow incorporating them. capability will be there.

tasks under an hour: they'll deliver. verification cheap enough. failures recoverable. net positive.

longer tasks: impressive demo, disappointing deployment, quiet rollback. not capability failures. verification failures. organizations can't confirm correctness at acceptable cost. no confirmation, no trust. no trust, no sustained deployment.

winning organizations won't have most capable agents. they'll have cheapest verification. domain-specific suites catching real failures. structured intermediates compressing auditing. reasoning traces making behavior legible. human checkpoints where verification is most efficient.

research that matters won't primarily scale models larger. bottleneck is moving. research that matters makes outputs checkable—interpretability, structured generation, verification-aware architectures, evaluation that scales.


the constraint nobody prices

capability is scaling on schedule. verification is not.

every forecast extrapolating METR's time horizons assumes verification keeps pace. METR's own data says otherwise—algorithmic-holistic gap grows as tasks lengthen. what we can check automatically diverges from what we need to check.

that divergence is the crossing. capability improvements stop translating to deployable value because confirming correctness costs too much.

2026 is when collision starts. systems that can do remarkable work but can't be used because knowing they did it right costs more than the work is worth.

organizations solving verification first capture value. technology to do work exists. bottleneck is knowing work was done.