breaking things to understand them

29 May, 2025

[mechanistic‐interpretability field notes – part 3 / 3]

last two posts i showed you how to find features in language models. clean interpretable units hidden in superposition. but finding features is like identifying organs in an alien. you can label the heart, lungs, liver. but do you really understand how they work?

only one way to find out. start cutting.

the first cut

i started simple. took gpt-2 small, found the "python code" feature in layer 4. it fires strongly whenever the model sees function definitions, imports, typical python syntax. clean signal.

what happens if i just... turn it off?

i fed the model "def calculate_average(numbers):" and watched. normally it continues with proper python. indent, maybe "total = 0", start a loop. standard stuff.

with the python feature suppressed: chaos. the model completes with "The history of medieval France began in..."

wait. what?

tried again. "import numpy as np" normally continues with more imports or code. with python feature off: "In conclusion, the patient presents with..."

the model didn't just lose python knowledge. it lost the entire context. like removing one card from a house of cards.

why suppression breaks everything

here's what i slowly realized. features aren't independent modules you can just unplug. they're more like notes in a chord. remove one note and the whole harmony changes.

when the python feature activates, it doesn't work alone. it influences dozens of other features:

"technical writing" features
"structured text" features
"next token is likely lowercase" features
"suppress narrative text" features

suppress python and all these downstream features get confused. they're still firing but now without context. the model falls back to generic patterns. usually wikipedia-style text since that dominates training data.

feature amplification

if suppression breaks things, what about amplification? i tried boosting the "formal academic writing" feature 10x strength.

fed it "the cat sat on the". normal completion: "mat" or "couch" or "windowsill".

with formal feature amplified: "aforementioned feline specimen positioned itself upon the designated resting surface whereby..."

hilarious. but also revealing. the model has different registers of language implemented as competing features. boost one and it dominates. like turning up the bass until you can't hear the melody.

the composition problem

but here's where it gets really weird. features don't just add. they compose in non-linear ways.

example: found three features:

"writing about france"
"historical text"
"talking about food"

activate just france: "France is a country in Europe..." activate just history: "In the year 1789..." activate just food: "The delicious taste of..."

activate france + history: "The French Revolution began in 1789..." makes sense. features combine naturally.

but activate all three: "Marie Antoinette's infamous quote about cake, though historically disputed, reflects the complex relationship between French culinary culture and revolutionary politics..."

whoa. the model didn't just add the topics. it found the intersection. the semantic space where all three concepts naturally meet. this isn't addition. it's conceptual triangulation.

discovering control vectors

this gave me an idea. if features compose semantically, can we build control vectors? directions in feature space that reliably change behavior?

i collected dozens of examples of formal vs casual writing. found the average feature activation difference. this gave me a "formality vector" - about 50 features that distinguish academic from conversational text.

apply this vector to any text and it shifts register. "hey what's up" becomes "Greetings, I inquire as to your current status."

but the cool part: it preserves meaning while changing style. the model understands these are different ways of saying the same thing.

i built vectors for:

happy vs sad sentiment
confident vs uncertain
technical vs simple explanations
past vs present tense

each vector involves 20-100 features working together. suppress some, amplify others. the changes are systematic and predictable.

the causal graph revelation

but i was still thinking too simply. features don't just combine. they form circuits. information flows from feature to feature across layers.

this is where attribution graphs changed everything. instead of just seeing which features are active, you can trace how they influence each other.

take "The capital of Texas is Austin". here's the actual circuit:

"capital" features in layer 2 activate
these trigger "seeking location" features in layer 4
which activate "Texas" features in layer 6
which combine with "location query" to activate "say Austin" features in layer 10

it's not just features. it's a computation graph. each feature processes information from earlier features and passes it forward.

surgical interventions

with attribution graphs, i could do precise surgery. not just "suppress the python feature" but "suppress the connection from python to structured-text."

the results were fascinating. suppressing feature-to-feature connections often had cleaner effects than suppressing features themselves.

example: model writes python with comments. there's a circuit from "python syntax" to "add explanatory comment". suppress that connection and the model still writes perfect python. just no comments.

it's like finding the specific wire that carries specific information. cut it and you remove one behavior while preserving others.

finding mesa-optimization

the wildest discovery came by accident. i was tracing how the model solves simple arithmetic. "what is 7 + 8?"

expected to find a circuit that just recalls "15" from training. instead found something bizarre. the model was actually computing: the model learned to break addition into easier sub-problems. nobody taught it this algorithm. it discovered decomposition helps predict the right answer.

"this is addition" features activate
trigger "decompose into parts" features
which activate "7 is 5+2" and "8 is 5+3"
then "5+5 is 10" and "2+3 is 5"
finally "10+5 is 15"

but here's the kicker: suppress any step and it fails. the model can't fall back to memorization. it only knows the algorithmic path.

the steering problem

armed with all these tools - suppression, amplification, vectors, circuit editing - i tried the obvious thing. can we steer the model to be more helpful? more truthful? less biased?

harder than it sounds.

truthfulness isn't a single feature. it's a complex property emerging from hundreds of features interacting. there's no "be truthful" neuron to amplify.

i found features correlated with truthfulness:

"expressing uncertainty"
"citing sources"
"acknowledging counterarguments"

amplify all of them and you get text that sounds truthful. lots of hedging, citations, nuance. but is it actually more truthful? or just performing truthfulness?

the model learned these features predict truthful text in training. but that's correlation, not causation. like amplifying "wearing a lab coat" to make someone a better scientist.

compound interventions

the real insights came from complex interventions. not just changing one thing but coordinated changes across the network.

example: making the model explain its reasoning. found a circuit from "answer" features to "output answer" features. inserted a detour through "explanation" features.

now the model naturally explains before answering. "The capital of Texas is... let me think, Texas is a large state in the southern United States, major cities include Houston, Dallas, and Austin which serves as the capital... Austin."

but notice: i didn't make it explain. i rerouted the information flow so explanation becomes part of the path to answering. the behavior emerges from the structure.

the humbling truth

after months of cutting, amplifying, rerouting, and steering, here's what i learned:

features are not modules. you can't just plug and unplug them. they're more like instruments in an orchestra. remove the violin and the whole symphony changes.
composition is semantic. features don't add linearly. they find meaningful intersections. the model understands concepts, not just patterns.
circuits are computations. information flows through features like data through a program. the structure of the circuit determines the computation.
behavior is emergent. complex properties like truthfulness aren't localized. they emerge from entire networks of features interacting.
control is possible but weird. we can steer models but not like driving a car. more like conducting an orchestra. indirect, emergent, sometimes surprising.

why this matters

we're at a weird point in history. we have these incredibly capable systems. we can kind of understand pieces of how they work. we can kind of control them. but it's all still held together with duct tape and good intentions.

the techniques i've shown - feature suppression, amplification, circuit editing - they work. you can change model behavior in predictable ways. but we're like mechanics who learned to fix cars by trial and error. we know what works but not always why.

the attribution graphs reveal real circuits. the interventions change real behaviors. but between our tools and true understanding remains a vast gulf.

what i actually do now

when i use language models now, i can't help but see the features firing. when claude helps debug my code, i imagine the python features activating, routing through debugging circuits, composing with explanation features.

when i prompt engineer, i'm really doing feature engineering. trying to activate the right features in the right order. it's not about tricking the model. it's about speaking its internal language.

and when models behave weirdly - hallucinations, contradictions, failures - i see it differently. not as bugs but as windows. glimpses of the alien computational process underneath.

we built these things to predict text. but in doing so, we accidentally built something that constructs thoughts from features, routes information through circuits, and composes concepts in semantic space.

we wanted a text predictor. we got a thinking machine. now we're slowly learning how it thinks.

the path forward

the tools exist. sparse autoencoders find features. attribution graphs reveal circuits. interventions test understanding. but we're still in the early days.

what we need:

better feature detection that works at scale
circuit discovery that's automated, not manual
intervention techniques that compose reliably
theory that explains why any of this works

but mostly we need time. time to map the territory. time to build better tools. time to understand what we've built.

because right now we're performing surgery with stone tools. we can see we're cutting something important. sometimes the patient even gets better. but we don't really know what we're doing.

the humble truth: we've learned enough to know how much we don't know. the models work. we can kind of see how. we can kind of control them.

but between "kind of" and "actually" lies the future of ai safety, ai capabilities, and maybe ai consciousness.

we have work to do.