Can we decode what LLM activations really represent in language?
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
LatentQA accepts an LLM activation plus any natural language question about it and returns a natural language answer. This dual-use architecture serves both interpretability (e.g., "[Activation] has gender bias") and controllability (e.g., minimizing the loss of "Q: Is [Activation] biased? A: No" over the activation via gradients to reduce bias).
Three design decisions proved critical for generalization:
Activation masking. Including activations from the full prompt lets the decoder shortcut by reading control token embeddings from the residual stream. Randomly masking control activations forces the decoder to read actual stimulus representations. Since stimulus tokens attend to control tokens, the signal is retained but the shortcut is blocked.
Data augmentation. Three types of training data provide complementary coverage: control data (decode properties specified in the prompt), stimulus data (predict properties from activations), and stimulus+completion data (predict properties from prompt-completion pairs). Together these cover the full range of LatentQA tasks.
Faithfulness of completion. Naive instruction following produces unfaithful completions. Using a more capable LLM to generate training triples improves faithfulness — the decoder learns from reliably controlled examples.
The most striking application: uncovering hidden system prompts given only a user-model dialog. Standard prompting struggles to distinguish between similar personas (e.g., Claude Shannon vs Alan Turing — both described as "codebreakers"). The activation decoder provides more precise identification because it reads representational information richer than what surface text conveys.
This connects to Can high-level concepts replace circuit-level analysis in AI? but with a crucial difference: RepE operates on predefined concepts (honesty, fairness), while LatentQA is open-ended — any question about any activation. The interpretability is not constrained to pre-hypothesized features.
The controllability connection to Can we track and steer personality shifts during model finetuning? is complementary: persona vectors steer via predefined directions, while LatentQA steers via natural language descriptions of desired behavior. LatentQA is more flexible (any description) but requires a trained decoder; persona vectors are more direct but require knowing which direction to steer.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What other latent LLM capabilities remain inactive without explicit activation cuing?
- Can activation decoders discover hidden system prompts from user-model conversations?
- Does activation masking prevent the decoder from taking interpretability shortcuts?
- Does encoding information in LM representations guarantee it influences output?
- Can we decode what individual circuits inside transformers are doing?
- Can LLM semantic representations exist without causally influencing their generation output?
- How does an instruction-following LLM activate latent retrieval knowledge?
- How can we probe LLM representations in channels that training did not target?
- How do LLM activations sparsify differently under out-of-distribution inputs?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
RepE uses predefined concepts; LatentQA is open-ended natural language
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors steer via predefined directions; LatentQA steers via natural language loss
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
LatentQA externalizes introspection into a trainable decoder rather than relying on emergent model capabilities
-
Can sparse weight training make neural networks interpretable by design?
Explores whether constraining most model weights to zero during training produces human-understandable circuits and disentangled representations, rather than attempting to reverse-engineer dense models after training.
different interpretability paradigm: LatentQA preserves the full model while adding an interpretive layer
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LatentQA: Teaching LLMs to Decode Activations Into Natural Language
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- Large Language Model Programs
- Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
- Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs
- Rethinking Interpretability in the Era of Large Language Models
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Semantic Structure in Large Language Model Embeddings
Original note title
latentqa teaches llms to decode their own activations into natural language — enabling interpretability and controllability via the same mechanism