What specific qualities make some demonstrations more effective for agency training?
This explores what makes a demonstration good for teaching agents to act — not how many demonstrations you have, but which qualities of the examples themselves transfer agentic skill.
This explores what makes a demonstration *good* for teaching an agent to act — the qualities of the example, not the size of the dataset. The corpus has a surprisingly sharp answer, and it starts by overturning the assumption that more is better. The LIMI work shows that 78 carefully chosen multi-turn trajectories beat models trained on 10,000+ samples by a wide margin Can careful selection of 78 demos outperform massive training datasets?. The decisive quality there is *completeness*: each demonstration captures a whole interaction sequence — the reasoning, the tool calls, the back-and-forth — rather than an isolated input-output pair. Complete trajectories seem to activate agentic patterns the pretrained model already latent-ly has, which is why a handful can do what thousands of fragments can't.
But completeness has a ceiling. Even perfect expert trajectories bound the agent to whatever the curator imagined, because the agent never interacts with the environment while learning — it can't discover anything the demonstrator didn't think to show Can agents learn beyond what their training data shows?. This points to a second quality: demonstrations are more effective when they include *failure*, not just polished success. ReasoningBank found that storing strategy-level lessons from both wins and losses beats success-only memory, because failures teach what to avoid and why Can agents learn better from their failures than successes?. A demonstration that only ever shows the agent winning leaves it brittle the moment reality diverges from the script.
A third quality is *fit to the learner*. A demonstration that is objectively higher quality can actually degrade performance if it sits beyond the student model's current learning frontier — students do better filtering teacher refinements down to what's compatible with their own capabilities Does teacher-refined data always improve student model performance?. The same logic shows up in training environments: moderately demanding, well-aligned setups produce better agents than maximally hard ones, because over-difficult demonstrations push the model outside the space it can actually explore Do harder training environments always produce better empathetic AI agents?. Effectiveness is relational, not absolute — a great demo for one model is wasted on another.
Two cautions round this out. First, surface mimicry is a trap. Imitating a strong model's confident style fools human evaluators while closing no real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and demonstrations that teach quality only through labeled examples tend to capture surface patterns rather than principled criteria — explicit frameworks transfer where raw examples don't Can models learn argument quality from labeled examples alone?. Second, even *where* you place a demonstration matters: moving an identical demo block from the start to the end of a prompt can swing accuracy by up to 20%, independent of its content How much does demo position alone affect in-context learning accuracy?.
The thread tying it together is that the most effective demonstrations are complete enough to show the whole act, honest enough to include failure, and matched to what the learner can actually absorb. And the deeper move in the corpus is to stop relying on static demonstrations at all — the agent's own deployment generates next-state signals (tool outputs, errors, user replies) that can train it on terrain no curator ever imagined Can agent deployment itself generate training signals automatically?. The best demonstration, ultimately, may be the agent's own experience.
Sources 9 notes
LIMI achieves 73.5% on AgencyBench using only 78 curated multi-turn trajectories, outperforming models trained on 10,000+ samples by 53.7%. Complete interaction sequences capturing tool use and reasoning appear to activate latent agentic patterns already present in pretrained models.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
RLVER research shows moderately demanding, well-aligned training environments produce better empathetic agents than maximally challenging configurations. Overly difficult setups push models outside their explorable space, causing instability rather than growth.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
Repositioning an identical demo block from prompt start to end swaps up to 20% accuracy and flips nearly half of predictions. This spatial effect operates independently of demo content and spans multiple task types.
Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.