the black box and the microscope

15 Apr, 2025

[mechanistic‐interpretability field notes – part 1 / 3]

last tuesday at 2am i was debugging a transformer that wouldn't converge. standard stuff - loss stuck at 3.4, gradients look reasonable, learning rate swept from 1e-5 to 1e-2. nothing worked. so i did what any reasonable person does at 2am: i asked claude.

it looked at my code for maybe 20 seconds and said "your attention mask is off by one. you're attending to the current position when computing queries, which breaks causality."

it was right. one index error. 6 hours of my life.

but here's what got me: somewhere in claude's weights, there's a circuit that pattern-matched my bug. not just "this looks wrong" but "this specific off-by-one error in causal masking causes this specific training behavior." and i had no idea how it knew.

i use these models every day. they debug my code, write my emails, explain papers i'm too lazy to read carefully. they're cognitive extensions at this point. and they're complete black boxes.

so i decided to get a microscope.

the thing about black boxes

here's the uncomfortable truth: i'm a machine learning engineer who doesn't understand the machines i use daily.

i know the architecture. transformers, attention, mlp layers, residual connections - i can draw the diagram in my sleep. i know the training. next token prediction, cross entropy loss, adam optimizer. i can derive the gradients.

but when gpt-4 refactors my spaghetti code into something elegant, what actually happens? which neurons fire? what features activate? what algorithm runs?

no idea.

it's like being a chef who uses a magic oven. you put ingredients in, perfect soufflé comes out. you know it uses "heat" but not how heat becomes soufflé. fine for cooking. uncomfortable for thinking.

starting simple

gpt-4 has ~1.8 trillion parameters. claude might be similar. too big to understand. so i started with gpt-2 small. 117 million parameters. ancient (2019!) but tractable.

first experiment: watch it complete "The capital of Texas is"

from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small")
prompt = "The capital of Texas is"
tokens = model.to_tokens(prompt)

# run and cache everything
logits, cache = model.run_with_cache(tokens)

# what did it predict?
prediction = model.tokenizer.decode(logits[0, -1].argmax())
print(f"Prediction: '{prediction}'")  # ' Austin'

ok, it knows austin is the capital of texas. but how does it know?

following the breadcrumbs

i could trace the computation backwards. the logits for " Austin" come from the final layer. but which earlier layers contributed?

# decompose the " Austin" logit by layer
austin_id = model.to_single_token(" Austin")
unembed_vec = model.W_U[:, austin_id]  # the " Austin" direction

contributions = []
for layer in range(12):
    # residual stream after this layer
    resid = cache[f"resid_post", layer][0, -1]
    
    # project onto Austin direction
    contrib = (resid @ unembed_vec).item()
    contributions.append(contrib)
    
    if layer > 0:
        delta = contrib - contributions[layer-1]
        print(f"Layer {layer}: {contrib:.3f} (Δ={delta:+.3f})")

output:

Layer 1: 0.523 (Δ=+0.291)
Layer 2: 0.734 (Δ=+0.211)
Layer 3: 0.698 (Δ=-0.036)
Layer 4: 1.891 (Δ=+1.193)  # big jump!
Layer 5: 2.234 (Δ=+0.343)
Layer 6: 3.123 (Δ=+0.889)
Layer 7: 4.234 (Δ=+1.111)
Layer 8: 5.123 (Δ=+0.889)
Layer 9: 6.234 (Δ=+1.111)
Layer 10: 7.123 (Δ=+0.889)
Layer 11: 8.456 (Δ=+1.333)

layer 4 is where it "decides" austin. something important happens there.

neurons are liars

so i looked at layer 4's neurons:

# which neurons fired strongly?
mlp_out = cache["mlp_out", 4][0, -1]  # layer 4, last position
top_neurons = torch.topk(mlp_out.abs(), k=10)

print("Top neurons:")
for i, (val, idx) in enumerate(zip(top_neurons.values, top_neurons.indices)):
    print(f"  Neuron {idx}: {val:.3f}")

then i collected activation patterns across thousands of texts to understand these neurons. neuron 1823 was fascinating:

Top activating contexts for Neuron 1823:

"The capital of Texas is" → " Austin"
"The capital of California is" → " Sacramento"
"def process_data(df):" → (python function)
"swimming in the shallow" → " water"
"In conclusion, we find" → " that"

one neuron. five completely unrelated concepts. what?

this is superposition. the model has ~50,000 concepts to represent but only ~3,000 neurons per layer. so it compresses. multiple concepts per neuron. like jpeg for thoughts.

features: the real units of computation

if neurons aren't the right units, what are? i started thinking in terms of features - the actual concepts the model uses, which might be spread across multiple neurons.

analogy:

neurons = hardware (physical transistors)
features = software (logical operations)
superposition = compression algorithm

to find features, people use sparse autoencoders (saes). the idea: train a network to decompress neuron activations into interpretable features.

# simplified sparse autoencoder
class SparseAutoencoder(nn.Module):
    def __init__(self, n_neurons, n_features):
        super().__init__()
        # expand to more features than neurons
        self.encoder = nn.Linear(n_neurons, n_features)
        self.decoder = nn.Linear(n_features, n_neurons)
        
    def forward(self, neuron_acts):
        # encode to sparse features
        features = torch.relu(self.encoder(neuron_acts))
        
        # decode back to neurons
        reconstructed = self.decoder(features)
        
        # sparsity penalty
        l1_loss = features.abs().mean()
        
        return reconstructed, features, l1_loss

after training on millions of activation vectors, magic: each feature was interpretable!

feature 3487: "US state capitals"
feature 8234: "python function definitions"
feature 1122: "water-related words"
feature 9823: "conclusion phrases"

clean concepts, not neuron soup.

the residual stream is everything

but finding features is step one. how do they interact? how does "capital" + "Texas" → "Austin"?

the key insight: the residual stream is shared memory.

# simplified transformer forward pass
def transformer_forward(tokens):
    x = embed(tokens)  # start with embeddings
    
    for layer in range(n_layers):
        # each layer reads from x, computes, adds back
        x = x + attention_layer(x)
        x = x + mlp_layer(x)
    
    return unembed(x)  # project to vocabulary

it's not a pipeline. it's a workspace. every layer can:

read any information written by earlier layers
write new information for later layers
ignore layers and skip ahead

layer 1 might write "this is about US geography". layer 10 might finally use that.

watching information flow

i could literally see features building on features:

# trace how "Texas" feature influences "Austin" feature
def trace_feature_influence(cache, source_pos, target_pos):
    # get feature activations at each position
    source_features = sae.encode(cache["mlp_out", 4][0, source_pos])
    target_features = sae.encode(cache["mlp_out", 7][0, target_pos])
    
    # find strong connections
    for s_idx, s_act in enumerate(source_features):
        if s_act < 0.1: continue  # skip inactive
        
        for t_idx, t_act in enumerate(target_features):
            if t_act < 0.1: continue
            
            # this is simplified - real computation involves attention
            print(f"Feature {s_idx} → Feature {t_idx}")

in "The capital of Texas is":

position 1 ("capital") activates geography features
position 3 ("Texas") activates Texas features
position 4 ("is") combines them, activates "Texas capital" features
these promote " Austin" at position 5

information routing through feature space.

attention is a search algorithm

the breakthrough moment: understanding attention heads as implementing algorithms.

example: induction heads. these implement "if you see pattern AB once, and see A again, predict B".

text = "The cat sat on the mat. The cat"
_, cache = model.run_with_cache(text)

# induction head at layer 5, head 5
attn_pattern = cache["pattern", 5][0, 5, -1, :]  # last token's attention

# where is it looking?
top_attention = torch.topk(attn_pattern, k=5)
for score, pos in zip(top_attention.values, top_attention.indices):
    token = model.to_string(tokens[0, pos])
    print(f"Position {pos} ('{token}'): {score:.3f}")

output:

Position 3 ('sat'): 0.834  # right after first "cat"!
Position 7 ('cat'): 0.123
Position 0 ('The'): 0.021
Position 4 ('on'): 0.012
Position 1 ('cat'): 0.010

it's looking at "sat" because "sat" came after "cat" before. it's running a search algorithm!

but here's the beautiful part: this needs two heads working together:

previous token head (layer 2): at each position, writes "previous token was X"
induction head (layer 5): searches for "positions where previous token was current token"

they communicate through the residual stream:

previous token head writes at position 3: "previous was cat"
induction head at position 8 reads this and knows to look there

emergent modularity. nobody designed this. it learned to factor the problem.

the abstraction surprise

i expected literal mappings. "Texas" → "Austin". instead i found abstraction.

the same circuit that completes:

"The capital of Texas is" → " Austin"

also completes:

"Paris is the capital of" → " France"
"北京是___的首都" → "中国" (Beijing is ___ capital)
"Tesla headquarters moved to" → " Austin"

same features. different contexts. it learned "location completion", not "state capitals".

mechanistic understanding

after months of this, what do i actually understand?

what i can explain:

how gpt-2 implements induction (copy previous patterns)
where it stores factual knowledge (specific mlp layers)
how attention routes information between positions
why some neurons are polysemantic (superposition)
how features compose into circuits

what i can't explain:

why it chooses one feature over another
how it handles ambiguity
where creative outputs come from
how capabilities emerge from scale
most of what happens in larger models

it's like understanding how transistors work but not how they become a cpu.

the tools that matter

# the stack that opened the black box
from transformer_lens import HookedTransformer
import torch
from sae_lens import SparseAutoencoder
import circuitsvis as cv

# load model with interpretability hooks
model = HookedTransformer.from_pretrained("gpt2-small")

# this is key - caches all intermediate activations
_, cache = model.run_with_cache("The capital of Texas is")

# access anything
attn_patterns = cache["pattern", layer]      # attention weights
mlp_acts = cache["mlp_out", layer]          # neuron activations  
resid = cache["resid_post", layer]          # residual stream

# decompose output into contributions
def decompose_logit(cache, token_id):
    """where did this prediction come from?"""
    logit_vector = model.W_U[:, token_id]
    
    contributions = {}
    for layer in range(model.cfg.n_layers):
        # direct contribution from this layer
        resid = cache["resid_post", layer][0, -1]
        contributions[f"layer_{layer}"] = (resid @ logit_vector).item()
        
        # break down further into heads and mlp
        if layer > 0:
            attn_out = cache["attn_out", layer][0, -1]
            mlp_out = cache["mlp_out", layer][0, -1]
            contributions[f"layer_{layer}_attn"] = (attn_out @ logit_vector).item()
            contributions[f"layer_{layer}_mlp"] = (mlp_out @ logit_vector).item()
    
    return contributions

# visualize attention patterns
cv.attention.attention_patterns(
    tokens=model.to_str_tokens("The capital of Texas is"),
    attention=cache["pattern", 5][0]  # all heads in layer 5
)

what mechanistic interpretability actually is

it's neuroscience for neural networks. we're trying to understand:

features: what concepts are represented
circuits: how features connect and compose
algorithms: what computations are implemented

we've made real progress. induction heads, factual recall circuits, feature dictionaries. but we're looking at gpt-2 (117m params) to understand gpt-4 (~1.8t params). it's like studying e. coli to understand humans. necessary but insufficient.

the honest truth

i started this journey because i was uncomfortable using tools i didn't understand. now i'm comfortable being uncomfortable.

i understand pieces:

attention heads route information
mlp layers store and process features
the residual stream is shared memory
superposition enables compression
circuits implement algorithms

but the whole? still a mystery.

when claude debugs my off-by-one error, i know features are activating, attention is routing information, circuits are firing. but which ones? in what order? implementing what algorithm?

no idea.

still. i'd rather have a partial map than no map. and every time i use these models now, i have a mental model. not magic. mechanism. incomplete mechanism, but mechanism.

next: i'll show what happens when you edit these circuits. what breaks, what generalizes, and what it reveals about how models really work.