Can in-context reinforcement learning match human sample efficiency on real problems?

This explores whether learning from experience held in the model's context window — rather than from weight updates — can reach the kind of fast, few-shot learning humans show on messy real-world tasks, and what the corpus says is still missing.

This explores whether in-context reinforcement learning — where a model improves by reading its own past attempts inside the prompt instead of retraining its weights — can rival how quickly humans learn from a handful of tries. The honest answer the corpus points to: the mechanism exists and is promising, but it leans on structural tricks that humans don't seem to need, and the corpus is more interested in *what makes it work* than in claiming parity.

The foundational requirement is surprisingly specific. Models don't learn in-context from scattered examples; they need what Why do trajectories matter more than individual examples for in-context learning? calls trajectory burstiness — whole or partial runs through the *same* environment, packed together in context. Given that, a model can generalize to wildly different tasks with no weight updates at all. That's the closest thing here to human-style sample efficiency: learn the dynamics from a few coherent episodes, then transfer. But the 'few' is doing heavy lifting — the trajectories have to be the right kind, not just any examples.

Where it gets more human is in *how* episodes get used. Should successful and failed episodes be processed differently? shows that treating successes as concrete demonstrations and failures as abstracted lessons — rather than dumping everything in uniformly — hits state-of-the-art on complex tasks while using far less context. The paper explicitly notes this asymmetry mirrors how human experts reason. So sample efficiency isn't just about how many trajectories you see; it's about compressing them the way a person extracts a rule from a mistake. Relatedly, Can models improve themselves using only majority voting? shows models can manufacture their own reward signal from majority-vote consensus on unlabeled problems, bootstrapping improvement at test time without any ground-truth labels — another move toward learning from raw experience the way humans do on problems nobody graded for them.

The corpus also names the wall. Can agents learn beyond what their training data shows? argues that agents trained only on static expert data can't learn from their own failures or exceed what curators imagined — precisely the brittleness in-context RL tries to escape by letting the model interact and adapt live. And Can natural language feedback overcome numerical reward plateaus? adds a sharp diagnostic: numerical rewards plateau because they say *that* you failed but not *why*, while a chain-of-thought critique unsticks the model. Human sample efficiency may come partly from the fact that we always learn against rich verbal explanation, not a scalar score — which suggests in-context RL closes the gap fastest when fed language feedback, not numbers.

One unexpected doorway if you want the human-comparison made literal: Can language models learn to model human decision making? finds models trained on psychology-experiment data out-predict purpose-built cognitive theories of human decision-making and even capture individual differences. So the corpus can model human learning behavior well — but that's a different claim from matching human *efficiency* on novel real problems, and no note here closes that last gap. The takeaway: in-context RL is approaching human-like learning by borrowing human-like structure (coherent trajectories, asymmetric memory, verbal critique), not by being natively as efficient.

Sources 6 notes

Why do trajectories matter more than individual examples for in-context learning?

In-context learning for sequential decision-making requires full or partial trajectories from the same environment level, not just isolated examples. This structural property—trajectory burstiness—allows models to generalize across vastly different tasks without weight updates.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether in-context reinforcement learning can match human sample efficiency on real problems. The question remains open; the constraints may have shifted.

What a curated library found — and when (findings span 2023–2025; these are dated claims):
• Trajectory burstiness — coherent episode sequences packed together — is *required* for in-context learning to generalize; scattered examples fail. Models then transfer across novel tasks without weight updates (2023–2024).
• Asymmetric trajectory processing (successes as concrete demos, failures as abstracted rules) mirrors human expert reasoning and cuts context use while matching state-of-the-art on complex tasks (2024–2025).
• Natural language feedback (chain-of-thought critique) unsticks performance plateaus that numerical rewards alone cannot overcome; human learning may leverage rich verbal explanation as a core advantage (2025).
• Test-time RL via majority-vote consensus can bootstrap improvement on unlabeled problems without ground-truth reward, approximating self-directed learning (2025).
• Expert-only training locks agents into curator imagination; live adaptation is essential to exceed static data (2024).

Anchor papers (verify; mind their dates):
• arXiv:2312.03801 (Dec 2023) — in-context generalization foundation
• arXiv:2504.16084 (Apr 2025) — test-time RL mechanisms
• arXiv:2506.03106 (Jun 2025) — critique-GRPO and natural language feedback
• arXiv:2510.08558 (Oct 2025) — agent learning from early experience

Your task:
(1) RE-TEST EACH CONSTRAINT. For trajectory burstiness, asymmetric processing, and language-vs-numerical feedback: has finetuning at scale (arXiv:2505.11711), verifiable meta-reasoning (arXiv:2507.22844), or checklist alignment (arXiv:2507.18624) *relaxed* any of these limits? Where does each constraint still hold? Cite the mechanism that relaxed or confirmed it.
(2) Surface work from Sep–Oct 2025 (Machine Bullshit, Agent Learning via Early Experience) that contradicts or reshapes the human-efficiency claim. Does emergent truth-disregard undermine in-context RL parity?
(3) Propose 2 questions that assume the regime has moved: (a) Can verifiable meta-reasoning + language feedback close the gap below ~3-shot human performance? (b) Does early-experience learning (arXiv:2510.08558) suggest trajectory burstiness is now optional?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can in-context reinforcement learning match human sample efficiency on real problems?

Sources 6 notes

Next inquiring lines