Can in-context reinforcement learning match human sample efficiency on real problems?
This explores whether learning from experience held in the model's context window — rather than from weight updates — can reach the kind of fast, few-shot learning humans show on messy real-world tasks, and what the corpus says is still missing.
This explores whether in-context reinforcement learning — where a model improves by reading its own past attempts inside the prompt instead of retraining its weights — can rival how quickly humans learn from a handful of tries. The honest answer the corpus points to: the mechanism exists and is promising, but it leans on structural tricks that humans don't seem to need, and the corpus is more interested in *what makes it work* than in claiming parity.
The foundational requirement is surprisingly specific. Models don't learn in-context from scattered examples; they need what Why do trajectories matter more than individual examples for in-context learning? calls trajectory burstiness — whole or partial runs through the *same* environment, packed together in context. Given that, a model can generalize to wildly different tasks with no weight updates at all. That's the closest thing here to human-style sample efficiency: learn the dynamics from a few coherent episodes, then transfer. But the 'few' is doing heavy lifting — the trajectories have to be the right kind, not just any examples.
Where it gets more human is in *how* episodes get used. Should successful and failed episodes be processed differently? shows that treating successes as concrete demonstrations and failures as abstracted lessons — rather than dumping everything in uniformly — hits state-of-the-art on complex tasks while using far less context. The paper explicitly notes this asymmetry mirrors how human experts reason. So sample efficiency isn't just about how many trajectories you see; it's about compressing them the way a person extracts a rule from a mistake. Relatedly, Can models improve themselves using only majority voting? shows models can manufacture their own reward signal from majority-vote consensus on unlabeled problems, bootstrapping improvement at test time without any ground-truth labels — another move toward learning from raw experience the way humans do on problems nobody graded for them.
The corpus also names the wall. Can agents learn beyond what their training data shows? argues that agents trained only on static expert data can't learn from their own failures or exceed what curators imagined — precisely the brittleness in-context RL tries to escape by letting the model interact and adapt live. And Can natural language feedback overcome numerical reward plateaus? adds a sharp diagnostic: numerical rewards plateau because they say *that* you failed but not *why*, while a chain-of-thought critique unsticks the model. Human sample efficiency may come partly from the fact that we always learn against rich verbal explanation, not a scalar score — which suggests in-context RL closes the gap fastest when fed language feedback, not numbers.
One unexpected doorway if you want the human-comparison made literal: Can language models learn to model human decision making? finds models trained on psychology-experiment data out-predict purpose-built cognitive theories of human decision-making and even capture individual differences. So the corpus can model human learning behavior well — but that's a different claim from matching human *efficiency* on novel real problems, and no note here closes that last gap. The takeaway: in-context RL is approaching human-like learning by borrowing human-like structure (coherent trajectories, asymmetric memory, verbal critique), not by being natively as efficient.
Sources 6 notes
In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.