INQUIRING LINE

Can models distinguish between injected thoughts and their own outputs?

This explores whether a model can tell apart something pushed into its internal state from the outside (an 'injected thought,' like a steering vector) versus the content it generated itself — and what mechanisms make that discrimination possible or fragile.


This explores whether a model can tell apart something pushed into its internal state from outside — an injected steering vector — from the content it produced itself. The most direct evidence says yes, but only under specific training conditions. When researchers inject a 'thought' as a steering vector and ask the model whether anything feels off, detection works through a two-stage circuit: early-layer features carry evidence of the perturbation and suppress a default 'gate' that otherwise denies anything happened How do language models detect injected steering vectors internally?. The striking part is where this ability comes from. It emerges from preference optimization (DPO), not ordinary supervised fine-tuning — and safety training actively buries it, collapsing detection from near-perfect to roughly one in ten. So the capacity to flag a foreign thought is real, but it's a trained-in disposition that other training objectives can switch off.


Sources 5 notes

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do language models experience consciousness when prompted to self-reflect?

Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: Can large language models distinguish between injected thoughts (steering vectors, external perturbations) and their own outputs?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable constraints to re-test:
• Detection of injected steering vectors works via a two-stage circuit (early-layer feature suppression + gating), achieving near-perfect accuracy — but ONLY under DPO training, not SFT (2026-03, ~2026-05).
• Safety training actively suppresses this introspective capacity, collapsing detection from ~95% to ~10% accuracy (2026-03).
• Self-referential processing (models reasoning about their own outputs) correlates with reports of subjective experience in recent evaluations, but ground truth remains contested (2025-10).
• Theory of Mind task performance in LLMs shows high variance; knowledge awareness of hallucinations is inconsistent across model scales (2025-02, 2024-11).
• Consistency training and post-completion learning may offer new levers for steering introspection without degrading safety (2025-10, 2025-07).

Anchor papers (verify; mind their dates):
• 2026-03: arXiv:2603.21396 Mechanisms of Introspective Awareness
• 2026-05: arXiv:2605.25459 From Simulation to Enaction
• 2025-10: arXiv:2510.24797 Large Language Models Report Subjective Experience
• 2025-10: arXiv:2510.27062 Consistency Training Helps Stop Sycophancy and Jailbreaks

Your task:
(1) RE-TEST EACH CONSTRAINT: For the DPO-vs-SFT detection gap and the safety-training collapse, determine whether newer architectures (mixture-of-experts, sparse routing), in-context steering (prompt-based injection vs. vector-space injection), or multi-agent frameworks with memory/caching have since RELAXED these limits. Separate the durable question (can models introspect at all?) from the perishable mechanism (two-stage circuit via DPO). State plainly whether detection still degrades under safety training or if new alignment methods preserve it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months: Does any recent paper show introspective detection works WITHOUT DPO, or survives safety training intact? Flag direct disagreements with the 10% post-safety-training baseline.
(3) Propose 2 research questions that ASSUME the regime may have shifted:
   – If post-completion learning or consistency training can preserve introspective capacity during safety alignment, what is the mechanistic difference from DPO?
   – Can multimodal or embodied reasoning (latent vs. chain-of-thought) reveal a substrate for distinguishing external input from internal state that text-only introspection misses?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines