How can language models extract more value from fewer demonstrations?

This explores how models squeeze more learning out of a small number of examples — the sample-efficiency problem — by looking at what kind of signal each demonstration carries and where the hard limits are.

This explores how models squeeze more learning out of a small number of examples, and the corpus suggests the answer is less about *more data* than about *richer signal per example*. The most direct lever is making each demonstration do double duty by pairing right and wrong answers. Small models fine-tuned with DPO on a teacher's correct-and-incorrect function-calling pairs beat the same models trained on correct examples alone, because the explicit negative shows the model exactly which format failures to avoid — a contrastive example teaches a boundary, not just a target Can small models match large models on function calling?. The same instinct shows up in a different guise: instead of importing labeled preferences, a model can mine signal it already produces. Using the model's own answer-span confidence to rank its reasoning traces creates synthetic preferences that sharpen step-by-step reasoning with zero human labels Can model confidence work as a reward signal for reasoning?, and 'post-completion learning' reuses the normally-discarded sequence space after a model finishes answering to train it to grade its own work — extra learning at zero inference cost Can models learn to evaluate their own work during training?.

The more ambitious version of the same idea is to stop needing demonstrations at all and manufacture the missing feedback. A three-role self-play loop — a Challenger that escalates difficulty, a Reasoner that attempts, and a neutral Judge that gives binary verdicts — co-evolves skills with no human supervision, effectively generating its own curriculum and reward Can language models learn skills without human supervision?. And on the architecture side, latent-thought models add a scaling dimension that isn't parameters or data: a fast-learning set of latent vectors gives strong few-shot reasoning with far better sample efficiency than scaling the model up Can latent thought vectors scale language models beyond parameters?.

But the corpus also draws a sharp line around what few demonstrations can ever buy you. Prompting and prompt optimization operate entirely inside a model's existing training distribution — they reorganize and activate knowledge that's already there, but cannot inject knowledge the model never learned Can prompt optimization teach models knowledge they lack?. Worse, when in-context examples conflict with strong parametric priors, the model often ignores the demonstration entirely; textual prompting alone can't override what training baked in Why do language models ignore information in their context?. So 'more value from fewer demonstrations' has a ceiling: examples are excellent at *steering* latent capability and terrible at *adding* it.

The deepest version of that ceiling is structural. Self-improvement — squeezing value out of the model's own outputs rather than fresh data — is formally bounded by the generation-verification gap: every reliable improvement needs something external that can verify and enforce it, so a model can't bootstrap past its limits through metacognition alone What stops large language models from improving themselves?. This is exactly why the confidence-as-reward and self-play tricks work: they smuggle in a *verifier* (a confidence signal, a binary judge) to stand in for the missing external check.

The quiet payoff here is a reframe. The question that looks like 'how do I learn from less data' is really two questions — how rich is the signal in each example (negatives and self-generated preferences beat plain positives), and is there a verifier in the loop (without one, extra demonstrations just re-shuffle what the model already believes). Fewer demonstrations work when each one carries a contrast and a check, not when you simply hand the model more correct answers to imitate.

Sources 8 notes

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can language models extract more value from fewer demonstrations?** — remains open despite recent progress. Treat the findings below as dated claims (2024–2026) to be re-tested against current models and methods.

**What a curated library found — and when (dated claims, not current truth):**
Findings span 2024–2026:
- Pairing correct and incorrect examples (DPO) beats correct-only training; explicit negatives teach boundaries, not just targets (~2024).
- Models can mine their own confidence-ranked reasoning traces as synthetic preferences, sharpening step-by-step reasoning without human labels (~2025).
- Post-completion learning reuses post-EOS token space to train self-grading at zero inference cost (~2025).
- Self-play loops (Challenger–Reasoner–Judge) co-evolve skills with no human supervision (~2024).
- Latent-thought vectors scale few-shot reasoning better than parameter/data scaling alone (~2025).
- **Ceiling:** Prompting activates existing knowledge but cannot inject new knowledge; in-context examples fail when parametric priors override them (~2024–2026).
- **Deeper ceiling:** Self-improvement is formally bounded by the generation-verification gap; without external verification, models cannot bootstrap past their limits (~2024).

**Anchor papers (verify; mind their dates):**
- arXiv:2410.18890 (2024) — Small-model function calling via DPO.
- arXiv:2507.20252 (2025) — Post-completion learning.
- arXiv:2502.01567 (2025) — Latent-thought scaling.
- arXiv:2412.02674 (2024) — Self-improvement and the verification gap.

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For DPO, synthetic preferences, post-completion learning, and self-play: do newer model scales, instruction-tuning variants, or RL orchestration (e.g., verifier-in-the-loop tooling, retrieval augmentation) now relax the knowledge-injection ceiling? Does the generation-verification gap still hold for modern models, or do emerging verifier architectures bypass it? Separate the durable question (signal quality and verification matter) from perishable limits (maybe solved by recent training recipes).

(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has anything since mid-2025 shown that in-context learning *can* inject novel knowledge under certain conditions? Any new verifier designs that weaken the self-improvement bound?

(3) **Propose 2 research questions that assume the regime may have moved:**
   - If latent-thought scaling and verifiers have matured, what is the *new* frontier — sample efficiency beyond few-shot, or reliability of self-generated feedback at scale?
   - Can orchestrated multi-agent loops (with external verifiers) now achieve genuine self-bootstrapping where single-model self-play cannot?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

How can language models extract more value from fewer demonstrations?

Sources 8 notes

Next inquiring lines