the shape of thoughts

01 May, 2025

[mechanistic‐interpretability field notes – part 2 / 3]

last post i showed you neurons firing on texas. i left out the weird part. that same neuron also fires on python functions, water descriptions, and academic conclusions. one neuron. four concepts. zero apparent connection.

this broke my mental model completely. neurons were supposed to be simple. one neuron, one concept. like cells in the brain. instead i found chaos.

the polysemantic disaster

here's what sent me spiraling. i systematically checked what activated neuron 1429 in layer 4 of gpt-2 small. took the top 100 strongest activations across thousands of texts.

the results made no sense:

"Austin is the capital of Texas"
"def calculate_average(numbers):"
"swimming in the shallow water"
"Therefore, we conclude that"

at first i thought my code was wrong. checked it five times. nope. the neuron really does fire on all of these. strongly.

i spent a week staring at this. tried to find connections. maybe capitals and functions are both... structured? water and conclusions are both... flowing? nothing. just four random concepts sharing a neuron.

then i found more. neuron 892 fires on:

the word "because"
descriptions of france
python list comprehensions
text about breakfast foods

neuron 2341 activates on:

quotation marks
military terminology
words ending in "-ness"
discussions of climate

every neuron i checked was like this. a random grab bag of concepts. polysemantic - many meanings, one neuron.

the superposition breakthrough

the insight came while i was making coffee. the model has more concepts than neurons. way more.

think about it. gpt-2 small has about 3,000 neurons per layer. but how many concepts exist in language?

every person, place, object
every syntactic pattern
every semantic relationship
every domain-specific pattern

tens of thousands at minimum. probably hundreds of thousands.

3,000 neurons. 50,000+ concepts. the math doesn't work.

unless... unless the model is doing compression. packing multiple concepts into each neuron. that's what i was seeing. not confused neurons - compressed neurons.

seeing it geometrically

imagine you have a 2d space and need to store three arrows. if you only use the x and y axes, you can store two. but if you use three directions 120 degrees apart, you can store three. they interfere a bit (negative dot product) but if they're rarely active together, it works.

scale this to 3,000 dimensions. you can pack in far more than 3,000 vectors if you're willing to accept some interference. the key insight: concepts are sparse. "python function" and "swimming" rarely occur in the same context. so sharing a neuron is fine.

this is superposition. the model learned to compress.

finding the real units

if neurons aren't the units of thought, what are? features - the actual concepts, which might be spread across multiple neurons.

think of it like jpeg compression. the pixels (neurons) look noisy and meaningless. but there's a clean image (features) encoded in them. you just need to decompress.

enter sparse autoencoders. the idea is simple: train a network to decompress neuron activations into sparse features.

i trained one on layer 4 activations. set it to find 10,000 features from 3,000 neurons. the training forces sparsity - most features should be off most of the time.

the results were beautiful. that messy neuron 1429? the autoencoder decomposed it cleanly:

feature 3,847: purely state capitals
feature 892: python function definitions
feature 5,234: water depth
feature 7,123: conclusion phrases

each feature had a single, interpretable meaning. no more chaos.

tracing thoughts

with clean features, i could finally trace how the model thinks. take "Austin is the capital of Texas":

"Austin" activates city features
"capital" activates capital-city features
these combine to activate query-for-location features
which activate texas features
which finally promote saying "Texas"

it's not magic. it's routing. information flows from features to features through the network.

but here's what really blew my mind: abstraction emerges from compression.

the same circuit that completes "Paris is the capital of" with "France" also completes: the model didn't learn specific facts. it learned the abstract pattern "complete location queries". compression forced generalization.

"The Eiffel Tower is in" → "Paris"
"You can find the Louvre in" → "Paris"
"这个博物馆在" → "北京" (this museum is in → Beijing)

cross-layer thinking

saes work great within a layer. but thoughts span layers. a feature in layer 2 needs to influence layer 10.

enter cross-layer transcoders. instead of decompressing within a layer, they trace features across layers. a feature reads from layer 2 and can write to layers 2, 3, 4... all the way to 12.

this matches how thinking works. early layers identify "this is python code". later layers use that context to "suggest python syntax". the influence is direct, not mediated through layers 3-9.

with cross-layer features, circuits become even clearer. i can see "python context" in layer 2 directly connecting to "suggest def keyword" in layer 10. skipping the middle entirely.

the replacement model

here's where it gets wild. with enough features, you can run the model using features instead of neurons. same inputs, same outputs, but interpretable intermediate steps.

i tried this on simple prompts. about 60% of the time, the feature-based model gives the exact same answer as the original. the other 40%, there's some error - features don't capture everything yet.

but when it works, you can read the model's thoughts:

"capital" feature activates
"seeking location" feature activates
"texas" feature activates
"output texas" feature activates

it's like having subtitles for cognition.

what breaks my brain

the craziest part: features form naturally. nobody designed them. nobody told the model to create a "state capital" feature. it emerged from next-token prediction.

the model was just trying to predict text. to do that well, it needed to track concepts. to track more concepts than it had neurons, it learned compression. compression created superposition. superposition required sparse features. sparse features enabled abstraction.

next-token prediction created thought.

practical insights

after months of this work, here's what i actually learned:

neurons are implementation details. asking "what does this neuron do" is like asking "what does memory address 0x7fff5694a8c0 do". wrong level of abstraction.
features are sparse by necessity. not by design - by necessity. dense features would interfere too much. sparsity enables superposition.
abstraction is compression. when you have to squeeze 50k concepts into 3k neurons, you learn patterns. "output a location" is more compact than separate circuits for every location.
scale changes everything. gpt-2 small has ~10k features. gpt-3 probably has millions. the principles are the same but the complexity explodes.
we're still barely scratching the surface. even with perfect feature identification, we don't know why features form where they do, how they compose into behaviors, or what determines their geometry.

tools that actually work

if you want to explore this yourself, here's the minimal setup:

TransformerLens for model surgery. lets you cache and manipulate any activation.
sparse autoencoders for finding features. train them on cached activations from lots of text.
circuitsvis for visualization. seeing attention patterns and feature interactions.
patience. lots of patience. most of what you'll find is confusing noise. the signal is rare but worth it.

the honest truth

i understand pieces. i can show you individual features, trace specific circuits, demonstrate superposition. but i can't explain the whole.

it's progress. real progress. two years ago we were staring at neuron soup. now we have features, circuits, mechanisms. but gpt-2 small is to gpt-4 what a bicycle is to a spaceship. same principles, vastly different scale.

the good news: every bit of understanding compounds. features explain neurons. circuits explain behaviors. patterns repeat across scales.

the frustrating news: we're archaeologists with shovels facing a buried city. we can dig up pottery shards (features) and trace walls (circuits). but the civilization that built it remains alien.

still. when i use claude or gpt-4 now, i have a mental model. features activating, routing information, composing into thoughts. it's incomplete but it's not magic.

next post: what happens when you intervene. if you suppress the "python" feature, what breaks? if you amplify the "formal writing" feature, what changes? the answers reveal how shallow our understanding still is.

but also how deep it could become.