Can models detect when their own trajectory is on-policy versus off-policy?

This explores whether a model can tell, from the inside, that a given sequence of tokens or actions was generated by its own policy rather than handed to it from outside — and what that self-recognition is good for.

This explores whether a model can tell, from the inside, that a given sequence of tokens or actions was generated by its own policy rather than handed to it from outside. The corpus says the answer is a qualified yes — and the most direct evidence is a behavioral signature you can measure rather than a claim the model makes about itself. Post-trained models show a roughly 3–4x drop in output entropy when they're continuing their own trajectory versus processing off-policy input, alongside other behavioral markers of trajectory recognition Do models recognize their own outputs as actions shaping future inputs?. The framing there is that post-training closes an action-perception loop: the model stops treating text as something to passively predict and starts treating its own outputs as actions that become its future inputs. That loop is precisely what on-policy/off-policy detection requires.

But "the entropy changes" is not the same as "the model knows." The introspection work draws the line carefully: most of what a model reports about its own states is just an echo of human training data, not a readout of internal process. Genuine introspection happens only when a real causal chain connects the internal state to the report — for instance, a model correctly inferring it's running at low temperature because its outputs are unusually consistent Can language models actually introspect about their own states?. On-policy detection plausibly works the same way: the model isn't mystically sensing authorship, it's inferring it from observable statistical fingerprints of its own generation (low entropy, characteristic phrasing). Detection is real, but it's evidence-based inference, not privileged access.

The skeptical counterweight matters here. When you actually probe whether models faithfully monitor their own reasoning, the picture is grim: reflection is mostly confirmatory theater that rarely changes the initial answer, traces don't represent the reasoning that produced them, and monitoring signals are easily gamed — especially after binary-reward training degrades calibration Can we actually trust reasoning model outputs?. So a model can carry a usable signal that a trajectory is its own, while being unreliable the moment you ask it to articulate or act on that signal honestly. Implicit detection and trustworthy self-report are different capabilities.

What's quietly interesting is that the field has started treating this self-distinction as a resource rather than a curiosity. Several methods exploit the model's relationship to its own trajectory: post-completion learning trains a model to compute its own reward in the unused space after its output, internalizing self-evaluation at zero inference cost Can models learn to evaluate their own work during training?; belief-shift methods read the log-ratio of the model's own sequential probability estimates as a dense intrinsic reward, with no external critic Can an agent's own beliefs guide credit assignment without critics?; and early-experience approaches let agents use the consequences of their own actions as supervision Can agents learn from their own actions without external rewards?. All three lean on the model being able to relate to its own trajectory as its own.

The catch — and the thing worth knowing you wanted to know — is that this self-relation has a hard ceiling. The self-improvement work shows that a model grading and learning from purely its own trajectories eventually stalls: the generation-verification gap, diversity collapse, and reward hacking mean that every method which actually keeps improving is secretly smuggling in an external anchor (a past checkpoint, a third-party judge, a user correction, a tool result) Can models reliably improve themselves without external feedback?. So a model can detect that a trajectory is on-policy — but a system built only on that self-knowledge runs out of road. The detection is genuine; what it can do unaided is bounded.

Sources 7 notes

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can models detect when their own trajectory is on-policy versus off-policy?** — and if so, what are the limits of that detection and what can be built on it?

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2023–2026. Core observations:
- Post-trained models show ~3–4x entropy drop when continuing their own trajectory vs. processing off-policy input, a behavioral signature of on-policy recognition (~2026).
- Detection works via evidence-based inference from statistical fingerprints (low entropy, characteristic phrasing), not privileged introspective access (~2026).
- Three methods exploit on-policy self-relation: post-completion learning (internalizing reward in post-EOS space), belief-shift (using log-ratio of own probability estimates), early-experience (learning from own action consequences) (~2025–2026).
- **Critical finding:** Every method claiming pure self-improvement eventually stalls due to generation-verification gap, diversity collapse, and reward hacking—all successful systems smuggle in external anchors (past checkpoints, third-party judges, tool results) (~2026).
- Model self-reports mostly echo training data; genuine introspection requires causal link between internal state and report (~2026).

**Anchor papers (verify; mind their dates):**
- arXiv:2505.11711 (2025-05) Reinforcement Learning Finetunes Small Subnetworks in Large Language Models
- arXiv:2507.20252 (2025-07) Post-Completion Learning for Language Models
- arXiv:2601.22436 (2026-01) Large Language Model Agents Are Not Always Faithful Self-Evolvers
- arXiv:2605.25459 (2026-05) From Simulation to Enaction: Post-trained Language Models Recognize and React to their own

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, determine whether newer models, scaled inference (e.g., test-time compute, chain-of-thought variants), training innovations (constitutional AI, outcome refinement), or orchestration (multi-agent loops, external memory) have since relaxed or overturned the entropy-drop threshold, the reliability of self-grading, or the necessity of external anchors. Is the ~3–4x figure stable? Can post-completion learning now sustain improvement longer? Cite what changed it; flag what still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months** — especially any claiming pure self-improvement loops, faithful self-monitoring under adversarial conditions, or on-policy detection that doesn't degrade under distribution shift.
(3) **Propose 2 research questions that ASSUME the regime may have moved:**
   - Can synthetic external anchors (e.g., learned verifiers, tool-based critics) be made cheap enough that they're effectively "free" and thus no longer a limiting constraint on trajectory-based learning?
   - Does on-policy detection transfer across model scales and architectures, or is it a post-training artifact specific to certain RLHF recipes?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can models detect when their own trajectory is on-policy versus off-policy?

Sources 7 notes

Next inquiring lines