residue

the black box and the microscope

[mechanistic‐interpretability field notes – part 1 / 3]


last tuesday at 2am i was debugging a transformer that wouldn't converge. standard stuff - loss stuck at 3.4, gradients look reasonable, learning rate swept from 1e-5 to 1e-2. nothing worked. so i did what any reasonable person does at 2am: i asked claude.

it looked at my code for maybe 20 seconds and said "your attention mask is off by one. you're attending to the current position when computing queries, which breaks causality."

it was right. one index error. 6 hours of my life.

but here's what got me: somewhere in claude's weights, there's a circuit that pattern-matched my bug. not just "this looks wrong" but "this specific off-by-one error in causal masking causes this specific training behavior." and i had no idea how it knew.

i use these models every day. they debug my code, write my emails, explain papers i'm too lazy to read carefully. they're cognitive extensions at this point. and they're complete black boxes.

so i decided to get a microscope.

the thing about black boxes

here's the uncomfortable truth: i'm a machine learning engineer who doesn't understand the machines i use daily.

i know the architecture. transformers, attention, mlp layers, residual connections - i can draw the diagram in my sleep. i know the training. next token prediction, cross entropy loss, adam optimizer. i can derive the gradients.

but when gpt-4 refactors my spaghetti code into something elegant, what actually happens? which neurons fire? what features activate? what algorithm runs?

no idea.

it's like being a chef who uses a magic oven. you put ingredients in, perfect soufflé comes out. you know it uses "heat" but not how heat becomes soufflé. fine for cooking. uncomfortable for thinking.

starting simple

gpt-4 has ~1.8 trillion parameters. claude might be similar. too big to understand. so i started with gpt-2 small. 117 million parameters. ancient (2019!) but tractable.

first experiment: watch it complete "The capital of Texas is"

from transformer_lens import HookedTransformer
import torch

model = HookedTransformer.from_pretrained("gpt2-small")
prompt = "The capital of Texas is"
tokens = model.to_tokens(prompt)

# run and cache everything
logits, cache = model.run_with_cache(tokens)

# what did it predict?
prediction = model.tokenizer.decode(logits[0, -1].argmax())
print(f"Prediction: '{prediction}'")  # ' Austin'

ok, it knows austin is the capital of texas. but how does it know?

following the breadcrumbs

i could trace the computation backwards. the logits for " Austin" come from the final layer. but which earlier layers contributed?

# decompose the " Austin" logit by layer
austin_id = model.to_single_token(" Austin")
unembed_vec = model.W_U[:, austin_id]  # the " Austin" direction

contributions = []
for layer in range(12):
    # residual stream after this layer
    resid = cache[f"resid_post", layer][0, -1]
    
    # project onto Austin direction
    contrib = (resid @ unembed_vec).item()
    contributions.append(contrib)
    
    if layer > 0:
        delta = contrib - contributions[layer-1]
        print(f"Layer {layer}: {contrib:.3f} (Δ={delta:+.3f})")

output:

Layer 1: 0.523 (Δ=+0.291)
Layer 2: 0.734 (Δ=+0.211)
Layer 3: 0.698 (Δ=-0.036)
Layer 4: 1.891 (Δ=+1.193)  # big jump!
Layer 5: 2.234 (Δ=+0.343)
Layer 6: 3.123 (Δ=+0.889)
Layer 7: 4.234 (Δ=+1.111)
Layer 8: 5.123 (Δ=+0.889)
Layer 9: 6.234 (Δ=+1.111)
Layer 10: 7.123 (Δ=+0.889)
Layer 11: 8.456 (Δ=+1.333)

layer 4 is where it "decides" austin. something important happens there.

neurons are liars

so i looked at layer 4's neurons:

# which neurons fired strongly?
mlp_out = cache["mlp_out", 4][0, -1]  # layer 4, last position
top_neurons = torch.topk(mlp_out.abs(), k=10)

print("Top neurons:")
for i, (val, idx) in enumerate(zip(top_neurons.values, top_neurons.indices)):
    print(f"  Neuron {idx}: {val:.3f}")

then i collected activation patterns across thousands of texts to understand these neurons. neuron 1823 was fascinating:

Top activating contexts for Neuron 1823:

  1. "The capital of Texas is" → " Austin"
  2. "The capital of California is" → " Sacramento"
  3. "def process_data(df):" → (python function)
  4. "swimming in the shallow" → " water"
  5. "In conclusion, we find" → " that"

one neuron. five completely unrelated concepts. what?

this is superposition. the model has ~50,000 concepts to represent but only ~3,000 neurons per layer. so it compresses. multiple concepts per neuron. like jpeg for thoughts.

features: the real units of computation

if neurons aren't the right units, what are? i started thinking in terms of features - the actual concepts the model uses, which might be spread across multiple neurons.

analogy:

to find features, people use sparse autoencoders (saes). the idea: train a network to decompress neuron activations into interpretable features.

# simplified sparse autoencoder
class SparseAutoencoder(nn.Module):
    def __init__(self, n_neurons, n_features):
        super().__init__()
        # expand to more features than neurons
        self.encoder = nn.Linear(n_neurons, n_features)
        self.decoder = nn.Linear(n_features, n_neurons)
        
    def forward(self, neuron_acts):
        # encode to sparse features
        features = torch.relu(self.encoder(neuron_acts))
        
        # decode back to neurons
        reconstructed = self.decoder(features)
        
        # sparsity penalty
        l1_loss = features.abs().mean()
        
        return reconstructed, features, l1_loss

after training on millions of activation vectors, magic: each feature was interpretable!

clean concepts, not neuron soup.

the residual stream is everything

but finding features is step one. how do they interact? how does "capital" + "Texas" → "Austin"?

the key insight: the residual stream is shared memory.

# simplified transformer forward pass
def transformer_forward(tokens):
    x = embed(tokens)  # start with embeddings
    
    for layer in range(n_layers):
        # each layer reads from x, computes, adds back
        x = x + attention_layer(x)
        x = x + mlp_layer(x)
    
    return unembed(x)  # project to vocabulary

it's not a pipeline. it's a workspace. every layer can:

layer 1 might write "this is about US geography". layer 10 might finally use that.

watching information flow

i could literally see features building on features:

# trace how "Texas" feature influences "Austin" feature
def trace_feature_influence(cache, source_pos, target_pos):
    # get feature activations at each position
    source_features = sae.encode(cache["mlp_out", 4][0, source_pos])
    target_features = sae.encode(cache["mlp_out", 7][0, target_pos])
    
    # find strong connections
    for s_idx, s_act in enumerate(source_features):
        if s_act < 0.1: continue  # skip inactive
        
        for t_idx, t_act in enumerate(target_features):
            if t_act < 0.1: continue
            
            # this is simplified - real computation involves attention
            print(f"Feature {s_idx} → Feature {t_idx}")

in "The capital of Texas is":

information routing through feature space.

attention is a search algorithm

the breakthrough moment: understanding attention heads as implementing algorithms.

example: induction heads. these implement "if you see pattern AB once, and see A again, predict B".

text = "The cat sat on the mat. The cat"
_, cache = model.run_with_cache(text)

# induction head at layer 5, head 5
attn_pattern = cache["pattern", 5][0, 5, -1, :]  # last token's attention

# where is it looking?
top_attention = torch.topk(attn_pattern, k=5)
for score, pos in zip(top_attention.values, top_attention.indices):
    token = model.to_string(tokens[0, pos])
    print(f"Position {pos} ('{token}'): {score:.3f}")

output:

Position 3 ('sat'): 0.834  # right after first "cat"!
Position 7 ('cat'): 0.123
Position 0 ('The'): 0.021
Position 4 ('on'): 0.012
Position 1 ('cat'): 0.010

it's looking at "sat" because "sat" came after "cat" before. it's running a search algorithm!

but here's the beautiful part: this needs two heads working together:

  1. previous token head (layer 2): at each position, writes "previous token was X"
  2. induction head (layer 5): searches for "positions where previous token was current token"

they communicate through the residual stream:

emergent modularity. nobody designed this. it learned to factor the problem.

the abstraction surprise

i expected literal mappings. "Texas" → "Austin". instead i found abstraction.

the same circuit that completes:

also completes:

same features. different contexts. it learned "location completion", not "state capitals".

mechanistic understanding

after months of this, what do i actually understand?

what i can explain:

what i can't explain:

it's like understanding how transistors work but not how they become a cpu.

the tools that matter

# the stack that opened the black box
from transformer_lens import HookedTransformer
import torch
from sae_lens import SparseAutoencoder
import circuitsvis as cv

# load model with interpretability hooks
model = HookedTransformer.from_pretrained("gpt2-small")

# this is key - caches all intermediate activations
_, cache = model.run_with_cache("The capital of Texas is")

# access anything
attn_patterns = cache["pattern", layer]      # attention weights
mlp_acts = cache["mlp_out", layer]          # neuron activations  
resid = cache["resid_post", layer]          # residual stream

# decompose output into contributions
def decompose_logit(cache, token_id):
    """where did this prediction come from?"""
    logit_vector = model.W_U[:, token_id]
    
    contributions = {}
    for layer in range(model.cfg.n_layers):
        # direct contribution from this layer
        resid = cache["resid_post", layer][0, -1]
        contributions[f"layer_{layer}"] = (resid @ logit_vector).item()
        
        # break down further into heads and mlp
        if layer > 0:
            attn_out = cache["attn_out", layer][0, -1]
            mlp_out = cache["mlp_out", layer][0, -1]
            contributions[f"layer_{layer}_attn"] = (attn_out @ logit_vector).item()
            contributions[f"layer_{layer}_mlp"] = (mlp_out @ logit_vector).item()
    
    return contributions

# visualize attention patterns
cv.attention.attention_patterns(
    tokens=model.to_str_tokens("The capital of Texas is"),
    attention=cache["pattern", 5][0]  # all heads in layer 5
)

what mechanistic interpretability actually is

it's neuroscience for neural networks. we're trying to understand:

we've made real progress. induction heads, factual recall circuits, feature dictionaries. but we're looking at gpt-2 (117m params) to understand gpt-4 (~1.8t params). it's like studying e. coli to understand humans. necessary but insufficient.

the honest truth

i started this journey because i was uncomfortable using tools i didn't understand. now i'm comfortable being uncomfortable.

i understand pieces:

but the whole? still a mystery.

when claude debugs my off-by-one error, i know features are activating, attention is routing information, circuits are firing. but which ones? in what order? implementing what algorithm?

no idea.

still. i'd rather have a partial map than no map. and every time i use these models now, i have a mental model. not magic. mechanism. incomplete mechanism, but mechanism.

next: i'll show what happens when you edit these circuits. what breaks, what generalizes, and what it reveals about how models really work.