Can we decode what individual circuits inside transformers are doing?

This explores interpretability — whether we can look inside a transformer and read off what its internal components are computing, and what the corpus says about how legible those internals actually are.

This question reads as: can we open the box and name what specific parts of a transformer are doing? The corpus says yes — partially, and with some surprising tools — but it also warns that the inside is messier and more fluid than the word "circuit" suggests.

The most direct affirmative comes from work that trains a second model to translate activations into plain language. Can we decode what LLM activations really represent in language? shows you can build a decoder that answers natural-language questions about what a model's internal states encode — and then steer those states by gradient descent. Decoding isn't just observation here; it doubles as a control knob. A cheaper, older trick points the same way: the "logit lens" in Do transformers hide reasoning before producing filler tokens? reads partial predictions out of intermediate layers, and reveals models that quietly compute the correct answer in layers 1–3, then overwrite it with filler tokens to satisfy a format. The reasoning was recoverable from lower-ranked predictions the whole time — the internals leaked the truth even when the output hid it.

Where it gets harder is that the same behavior can sit on top of completely different machinery. What actually happens inside a language model? finds that identical outputs can be produced by radically different internal structures, and that pushing one property (accuracy) reliably degrades others (faithfulness, calibration). So "what is this circuit doing?" may not have one stable answer across two models that behave the same. Circuit analysis in Do foundation models learn world models or task-specific shortcuts? makes this concrete and a little deflating: models that look like they've learned arithmetic or orbital mechanics turn out, under the microscope, to be running range-matching heuristics and slice-dependent shortcuts, not the clean algorithms we'd hoped to find. You can decode the circuit — it's just doing something dumber than the behavior implied.

There's also a deeper reason decoding is slippery. Do transformer models store knowledge or generate it continuously? argues that transformers don't store knowledge in fixed, retrievable slots — knowledge exists as flowing activations, closer to oral performance than to a database. If a "circuit" is a momentary pattern of flow rather than a wired-in component, then interpretability is less like reading a schematic and more like catching a current. Even when structure does appear, it's developmental: How do transformers learn to reason across multiple steps? finds a measurable signature — cosine clustering of entity representations — that emerges in stages as the model learns to reason, meaning the thing you're trying to decode is itself a moving target during training.

So the honest answer: we have real, working methods to decode internal computation — activation decoders, logit lenses, circuit probes, representational signatures — and they sometimes reveal hidden reasoning the output conceals. But they also keep revealing that the internals are heuristic, flow-like, and non-unique. The frontier isn't whether we can decode circuits; it's whether "circuit" is even the right unit for something this fluid.

Sources 6 notes

Can we decode what LLM activations really represent in language?

LatentQA trains a decoder to answer natural language questions about LLM activations, enabling both interpretability (understanding what activations encode) and controllability (steering them via gradient descent). Critical design choices—activation masking, diverse training data, and faithful completions—proved essential for generalization.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Do foundation models learn world models or task-specific shortcuts?

Inductive bias probes show transformers trained on orbital mechanics and games learn predictive patterns, not unified world structure. Fine-tuning reveals nonsensical, slice-dependent laws; circuit analysis shows arithmetic relies on range-matching heuristics, not algorithms.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

How do transformers learn to reason across multiple steps?

Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.

Can we decode what individual circuits inside transformers are doing?

Sources 6 notes

Next inquiring lines