Does input surprise drive the implicit recognition of on-policy context?

This explores whether a model's sense of being 'on-policy' — recognizing that the text it's reading is its own generated trajectory rather than external input — is triggered by how predictable (low-surprise) that text feels to it.

This explores whether a model's sense of being 'on-policy' — recognizing that the text in front of it is its own output rather than someone else's — is something it picks up from how unsurprising that text feels. The corpus suggests the answer is roughly yes, with an interesting mechanism behind it. The clearest evidence comes from work showing that post-training flips a model from passive prediction into a kind of action-perception loop, where it treats its outputs as future inputs Do models recognize their own outputs as actions shaping future inputs?. The behavioral fingerprint of that recognition is a 3–4x drop in output entropy when the model is on its own trajectory. Low entropy is exactly the signature of low surprise: on-policy context is the context the model itself finds most predictable, so 'this feels like me' and 'this feels unsurprising' may be the same signal read two ways.

A neighboring note sharpens why surprise would be the right currency here. Whether a piece of text 'lands' and primes future behavior turns out to be predictable from its probability before any learning — there's a sharp threshold (~10^-3) separating contexts that take hold from those that don't Can we predict keyword priming before learning happens?. That's a strong hint that models are already gating on something like input likelihood when deciding what to internalize, which is the same quantity surprise measures.

But recognition isn't purely about a single token's surprise — it's also structural. In-context learning of behavior requires not isolated examples but full or partial trajectories from the same regime; this 'burstiness' is what lets a model recognize and generalize a policy without weight updates Why do trajectories matter more than individual examples for in-context learning?. So the recognition signal is likely surprise-over-a-trajectory (a coherent low-surprise stretch) rather than a one-off dip. There's even a wilder cousin to this: RL agents drift into using their environment as external memory, recognizing their own past traces in the world without ever being told to Do RL agents accidentally use environments as memory? — implicit self-recognition as a side effect of optimization, not an explicit objective.

The wrinkle — and the thing you might not have known to ask — is that surprise can be overridden. Models routinely ignore in-context information when their trained-in priors are strong enough, and textual prompting alone can't fix it Why do language models ignore information in their context?. So 'on-policy recognition driven by surprise' isn't a clean switch; it competes with parametric pull. That tension is also why training methods that lean on a model's own outputs as targets — consistency training to make a model treat perturbed and clean prompts identically — work at all Can models learn to ignore irrelevant prompt changes?: they're deliberately engineering what the model treats as 'unsurprising and mine.' The short version: surprise looks like a real driver of implicit on-policy recognition, but it's a contested signal, not a sovereign one.

Sources 6 notes

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: Does input surprise—measured as token/trajectory probability—drive a language model's implicit recognition that it is on-policy (i.e., that the text before it is its own output, not someone else's)?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat these as snapshot claims, not current ground truth.
- Post-training shifts models from passive prediction to an "enaction loop" where they treat their outputs as future inputs; on-policy context shows 3–4× entropy drop, matching low-surprise signals (~2025).
- A sharp probability threshold (~10^−3) separates contexts that "take hold" during learning; models gate on input likelihood when deciding what to internalize, the same quantity surprise measures (~2024).
- On-policy behavior recognition requires trajectory coherence ("burstiness"), not isolated tokens; a model learns policies in-context only when examples cluster in the same regime (~2023).
- Parametric priors can override surprise-driven signals; in-context information fails when trained associations are strong enough (~2025).
- Consistency training works by engineering outputs models treat as "unsurprising and mine," deliberately shaping the on-policy signal (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2312.03801 (2023) — In-context learning of sequential decision-making.
- arXiv:2605.25459 (2026) — Post-trained LMs recognize and react to their own outputs.
- arXiv:2510.27062 (2025) — Consistency training and sycophancy.
- arXiv:2604.08756 (2026) — External memory and agent boundaries.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the 3–4× entropy drop, trajectory burstiness requirement, and parametric-override mechanism: has scaling, new training regimes (DPO, PPO variants), or multi-agent / memory orchestration since RELAXED these? Separately flag what still appears to constrain on-policy recognition and cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Has recent work on mechanistic interpretability, synthetic data, or self-supervised alignment shown on-policy recognition operates via a *different* signal (not surprise)? Name it.
(3) **Propose 2 research questions** that assume the regime may have shifted: (a) Can surprise-driven on-policy recognition be *decoupled* from parametric priors via targeted intervention? (b) Does multi-agent orchestration (e.g., critic-actor loops) restructure how surprise gates self-recognition?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does input surprise drive the implicit recognition of on-policy context?

Sources 6 notes

Next inquiring lines