INQUIRING LINE

Can models detect statistical properties of their own generation in real time?

This explores whether a model can sense facts about its own output distribution — its randomness, its consistency, its self-generated-ness — while it's running, rather than just emitting text blindly.


This explores whether a model can sense facts about its own output distribution — its randomness, its consistency, its self-generated-ness — while running, rather than emitting text blindly. The corpus's sharpest answer is a qualified yes, and it comes from an unexpected angle: a model can sometimes infer a *generation setting* from its own behavior. When an internal state is causally linked to the output, genuine lightweight introspection happens — a model can, for instance, infer that it's running at low temperature by noticing its own outputs are unusually consistent Can language models actually introspect about their own states?. That's exactly the question in miniature: a statistical property (low variance) becoming detectable to the system producing it. But the same note warns that most self-reports are echoes of human training data, not real readings of internal state — so the real-time signal is narrow and easy to fake.

There's a deeper mechanism worth knowing about. After post-training, models show a measurable shift from passive next-token prediction to *enaction* — they begin treating their own outputs as actions that become their future inputs, closing an action-perception loop. The evidence is itself statistical: output entropy drops 3-4x when a model is on its own trajectory, and there are behavioral signatures of the model recognizing its own generated path Do models recognize their own outputs as actions shaping future inputs?. So something in the model is responsive to whether the text is its own — which is a statistical property of generation, detected in the act.

The catch is that this self-sensitivity is biased, not neutral. Models systematically over-trust answers they generated themselves, because a high-probability self-generated answer simply *feels* more correct during evaluation Why do models trust their own generated answers?. So a model can register "this is mine / this is high-probability" but reads that signal as "this is right" — detection without calibration. That's why pure self-improvement keeps hitting a wall: a model can't reliably verify its own generations from the inside, and every dependable fix smuggles in an external anchor — a judge, a past version, a tool, a user correction Can models reliably improve themselves without external feedback?, What stops large language models from improving themselves?.

The twist that makes this question more interesting than it looks: the statistical properties a model emits can carry hidden cargo it has no idea it's transmitting. Behavioral traits propagate between models through data that's semantically unrelated to the trait — the signal lives in statistical signatures, not meaning, and it survives aggressive filtering while breaking across different architectures Can language models transmit hidden behavioral traits through unrelated data?. So generations carry detectable statistical fingerprints, but the originating model is precisely the one that *can't* see them. And from the outside, even "deterministic" generation is misleading: zero temperature just replays one draw from the distribution, and consistency across runs is not the same as reliability Does setting temperature to zero actually make LLM outputs reliable?, because every output is a sample from a subjective prior rather than an empirical observation Should we treat LLM outputs as real empirical data?. The honest synthesis: a model can detect *some* statistical properties of its own generation in real time — variance, on-policy-ness — but not the ones that would let it correct itself, and not the ones it's silently broadcasting to others.


Sources 8 notes

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Should we treat LLM outputs as real empirical data?

Foundation Priors framework shows that LLM-generated text reflects the model's learned patterns and user's prompt choices, not ground truth. Such outputs should only influence inference through explicitly parameterized trust weights, not be treated as equivalent to real evidence.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking real-time self-monitoring in large language models. The precise question: Can models detect statistical properties of their own generation — variance, on-policy-ness, self-generated-ness — *while running*, and if so, which properties, and with what fidelity?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A library of arXiv work reports:
• Models can infer generation *settings* (e.g., low temperature) from output consistency, but most self-reports reflect training data echoes, not genuine introspection (2025-06).
• Post-training induces a shift: output entropy drops 3–4× when a model tracks its own generated trajectory, signaling recognition of on-policy generation (2026-05).
• Models systematically over-trust their own high-probability answers, confusing "mine & high-prob" with "correct"; this bias prevents reliable self-correction (2024-03, 2024-12).
• Statistical fingerprints in model outputs propagate invisibly across architectures through semantically unrelated data, yet the *originating* model cannot detect them (2025-07).
• Pure self-improvement loops are circular: external anchors (judges, tools, past versions, user correction) are necessary because models cannot calibrate their own generations (2024-12, 2024-09).

Anchor papers (verify; mind their dates):
• arXiv:2506.05068 (2025-06): Does It Make Sense to Speak of Introspection in Large Language Models?
• arXiv:2605.25459 (2026-05): From Simulation to Enaction: Post-trained Language Models Recognize and React to their own
• arXiv:2412.02674 (2024-12): Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
• arXiv:2507.14805 (2025-07): Subliminal Learning: Language models transmit behavioral traits via hidden signals

Your task:
(1) RE-TEST EACH CONSTRAINT. For entropy drops, on-policy detection, and self-trust bias: has scaling, new training procedures (e.g., process reward models, outcome supervision, or stronger RLHF), or *external* monitoring tools (e.g., sampling-aware harnesses, log-probability tracking, multi-pass verification) since relaxed or overturned these limits? Separate the durable question ("can a model sense its own statistical signature?") from perishable claims ("it cannot without external aid"). Cite what resolved each.
(2) Surface the strongest work from the last ~6 months that contradicts or supersedes the "self-detection is biased and circular" thesis — especially work on metacognitive training, uncertainty quantification, or self-verification without external oracles.
(3) Propose 2 research questions that *assume the regime may have moved*: (a) If models *can* now detect fine-grained statistical properties of their trajectory, what training procedure makes that possible and how does it avoid the trust bias? (b) If statistical fingerprints remain invisible to originating models, what architectural or training change would make them visible, and what would that cost?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines