What execution feedback signals drive context updates without supervision labels?
This explores how AI systems improve their context, prompts, or reasoning using signals generated from their own actions and outcomes — rather than human-written labels — and what those signals actually are.
This question is really asking: when nobody hands the model a labeled answer key, what does it learn from instead? The corpus points to a recurring move — turn the *structure of what the model already did* into a teaching signal. The clearest example is the ACE framework, which treats context as an evolving playbook updated through generation-reflection-curation loops: the system runs, reflects on what worked, and curates incremental edits, gaining +10.6% on agentic tasks and +8.6% on finance with no labeled supervision Can context playbooks prevent knowledge loss during iteration?. The feedback driving the update is the trajectory's own outcome, not an annotation.
A whole cluster of work shows the same principle inside reinforcement learning: outcome rewards get converted into dense, step-level signals purely from how a trajectory is shaped. Tree-search rollouts compare sibling subtrees to manufacture step-wise preferences from a single final reward Can tree structure alone convert outcome rewards into process supervision?, and the *depth* of random tree expansion automatically yields supervision at multiple granularities — coarse strategy signals from early branches, fine detail from late ones — without any granularity scheduling Does tree depth automatically produce supervision at multiple granularities?. More broadly, several methods exploit structural features — tree topology, expert-aligned actions, tool-call positions — to replace hand-annotated process reward models entirely Can trajectory structure replace hand-annotated process rewards?. The 'label' is latent in the shape of the execution itself.
The most interesting cross-domain framing is *why* execution signals work at all: post-training shifts a model from passive prediction to recognizing that its own outputs become its future inputs, closing an action-perception loop with measurable signatures (3–4x lower on-policy entropy) Do models recognize their own outputs as actions shaping future inputs?. Once a model treats its outputs as actions, its own trajectory becomes a usable feedback channel. Consistency training leans on exactly this — it uses the model's *own clean responses* as targets to teach invariance to prompt perturbations, sidestepping the staleness of fixed human labels Can models learn to ignore irrelevant prompt changes?.
There's a quieter lesson here worth knowing: not all unsupervised signal is the signal you think. Instruction tuning, it turns out, mostly transfers knowledge of the output *format* distribution rather than task understanding — models trained on deliberately wrong instructions match correct ones (43% vs 42.6%) Does instruction tuning teach task understanding or output format?. So when you update context from execution feedback, you have to ask what the feedback is actually carrying. And updating context isn't free: the real bottleneck in consolidating long context into internal state is compute, not memory — performance scales with how many consolidation passes you run Is long-context bottleneck really about memory or compute?.
The through-line: the field is steadily replacing human labels with signals harvested from the system's own behavior — trajectory structure, sibling comparisons, self-generated targets, the action-perception loop. The catch the corpus keeps surfacing is that these signals are mutable and ephemeral How does AI context differ from conventional software context?, so the discipline isn't just *getting* a label-free signal — it's curating which signal you trust and what it actually teaches.
Sources 9 notes
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO's random expansion strategy naturally produces supervision at varying granularities—early branches provide coarse strategy-level signals while late branches provide fine-grained detail supervision. This multi-resolution signal emerges from sampling structure alone, without annotation effort or granularity scheduling.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
AI interactions operate on a substrate of constantly shifting context—prompt, history, retrieved data, hidden state—that users cannot internalize like traditional UIs. This structural mutability demands a new design discipline centered on context engineering rather than interface design.