Are detection and identification of injections truly separable in neural circuits?

This explores whether 'noticing that something was injected' (detection) and 'knowing what was injected' (identification) are two distinct mechanisms inside a model's circuitry — or one entangled process the corpus can't actually pull apart.

This explores whether detecting an injection and identifying it are genuinely separate steps in neural circuits, or whether we're imposing a clean split on a messier reality. The most direct evidence comes from work showing that preference optimization builds a literal two-stage circuit: early-layer 'evidence-carrier' features that flag *that* a perturbation is present, which then suppress 'gate' features that otherwise default to denial How do language models detect injected steering vectors internally?. That architecture is suggestive — detection (evidence carriers) and the downstream act of reporting/identifying (gate suppression) sit in different layers and play different roles. So at first pass, yes, they look separable.

But the corpus pushes back on taking that picture at face value. A recurring lesson is that you cannot establish a functional split from representational analysis alone — locating features that *correlate* with detection doesn't prove they *cause* a distinct identification step; only paired representational-then-causal verification (ablate the feature, watch the behavior) earns that claim Can we understand LLM mechanisms with only representational analysis?. And there's a deeper trap: models can reach identical behavior through radically different internal structures, so a circuit that looks two-stage in one model may be fused in another with the same output What actually happens inside a language model?. 'Separable' might be a property of the analysis, not the network.

What would make separability real rather than apparent is genuine modularity — and there's evidence networks do decompose compositional tasks into isolated subnetworks, where ablating one piece affects only its function, with pretraining sharpening that cleanliness Do neural networks naturally learn modular compositional structure?. Training explicitly for sparse weights can force this even further, yielding disentangled circuits where ablation studies confirm necessity and sufficiency Can sparse weight training make neural networks interpretable by design?. The catch: that interpretability holds at small scale and hasn't survived scaling up — so the clean separability we can verify may not be the separability that operates in frontier models.

There's also a functional-layering angle worth knowing about. The corpus finds knowledge retrieval living in lower layers and reasoning adjustment in higher ones Why does reasoning training help math but hurt medical tasks? — which rhymes with detection-before-identification as a depth-ordered pipeline. But layer separation is a weaker claim than circuit separation: things can be ordered by depth while still being computationally entangled.

The honest synthesis: the corpus offers one strong existence-proof of a staged detect-then-respond circuit How do language models detect injected steering vectors internally?, and several reasons to distrust generalizing it — degenerate internal solutions, the correlation-vs-causation gap, and modularity that's verified only at toy scale. So 'truly separable' is best read as *demonstrably separable in specific trained circuits, not provably separable in general*. The interesting wrinkle most readers won't expect: safety training actively *suppresses* the detection stage (dropping it from ~64% to ~11%), which means the separability is not just architectural but something training can selectively dial down — the two functions are independent enough that you can damage one and leave the substrate intact.

Sources 6 notes

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

What actually happens inside a language model?

Research shows that LLMs can achieve the same output through different internal mechanisms, and improvements in one dimension like accuracy reliably degrade others like faithfulness and calibration. Internal structure matters even when behavior appears identical.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher evaluating whether detection and identification of injections are truly separable in neural circuits. This remains an open question.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, with acceleration in 2025–2026:
• DPO training builds a two-stage circuit: early-layer 'evidence-carrier' features flag perturbations (~64% detection rate); downstream 'gate' features suppress denial via layer-separated mechanisms (2026-03, arXiv:2603.21396).
• Representational analysis alone cannot prove functional separability—only paired representational + causal intervention (ablation) establishes causation; models reach identical behavior through radically different internal structures (2025-03, arXiv:2503.13401).
• Networks decompose compositional tasks into modular subnetworks; sparse weight training forces disentanglement, but clean separability verified at toy scale does not survive scaling to frontier models (2025-11, arXiv:2511.13653).
• Knowledge resides in lower layers; reasoning adjustment in higher layers—a depth-ordered pipeline consistent with detect-then-identify, but layer separation ≠ circuit separation (2025-07, arXiv:2507.18178).
• Safety training actively suppresses the detection stage (64% → 11%), proving the two functions are independent enough to dial down selectively (2026-03, arXiv:2603.21396).

Anchor papers (verify; mind their dates):
• arXiv:2603.21396 (2026-03): Mechanisms of Introspective Awareness — core two-stage circuit evidence.
• arXiv:2511.13653 (2025-11): Weight-sparse transformers have interpretable circuits — modularity at scale.
• arXiv:2507.18178 (2025-07): Decoupling Knowledge and Reasoning in LLMs — layer-function mapping.
• arXiv:2301.10884 (2023-01): Break It Down: Structural Compositionality — foundational decomposition work.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the two-stage circuit claim: has scaling, new training regimes (e.g., consistency training, utility engineering), or improved ablation harnesses since early 2026 shown it breaks down, merges, or strengthens? For the modularity claim: do recent sparse-training or mechanistic-lottery results confirm or refute that disentanglement survives scaling? Separate what is durable (the question of whether separability is real or analytic) from what may be resolved (whether specific circuits exhibit it).
(2) Surface the strongest contradicting or superseding work from the last ~6 months—esp. any showing detection and identification are *fused* even under dense scrutiny, or that training-induced suppression of detection is reversible/brittle.
(3) Propose 2 research questions that assume the regime may have shifted: (a) If safety training can dial down detection independently, can adversarial training or fine-tuning re-couple them, and if so, what does that tell us about their true separability? (b) Do frontier models (3B+ params, multi-stage instruction-tuning) exhibit the same two-stage circuit, or do they route detection and identification through a fused distributed state?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Are detection and identification of injections truly separable in neural circuits?

Sources 6 notes

Next inquiring lines