INQUIRING LINE

Does the 78-demonstration principle apply to other AI capabilities beyond agency?

This explores whether the idea that a tiny set of demonstrations ("78") can unlock agentic behavior reflects a broader pattern — that small amounts of training elicit latent capabilities rather than teaching new ones — across reasoning, initiative, and self-improvement.


This reads the "78-demonstration principle" as a specific instance of a more general claim: that a small, well-chosen training signal doesn't *create* a capability so much as *surface* one the model already latently holds. Read that way, the corpus has a lot to say — and the most striking material isn't about agency at all. The cleanest cross-capability evidence comes from reasoning: five independent methods (RL steering, critique fine-tuning, decoding tweaks, feature steering, RLVR) all elicit reasoning that already lives in base-model activations, suggesting post-training *selects* reasoning rather than building it (Do base models already contain hidden reasoning ability?). If reasoning works this way, the "few demonstrations unlock a lot" pattern isn't unique to agency — it's a property of how these models store competence.

Initiative looks the same. Proactive behavior — asking clarifying questions, pushing back — jumps from near-zero (0.15%) to dominant (73.98%) under modest RL, which implies the capacity was structurally suppressed rather than absent; standard next-turn reward optimization trains the initiative *out*, and a little targeted signal trains it back (Why do AI agents fail to take initiative?). That's the same shape as the 78-demonstration story: the bottleneck is elicitation, not acquisition. Even reasoning in *unverifiable* domains can be recovered from demonstrations alone, by inferring the implicit reward behind expert behavior rather than needing a task verifier (Can reasoning emerge from expert demonstrations alone?).

But the corpus also marks the principle's edges, and this is where it gets interesting. Demonstrations are a ceiling as much as a key: agents trained only on static expert data are locked into "the imagination of the curator" — they can't learn from their own failures or generalize past what was shown, because they never interacted with anything (Can agents learn beyond what their training data shows?). So a small demonstration set may *unlock* a capability while simultaneously *bounding* it. The contrast case is self-improvement, which seems to break the demonstration ceiling entirely: the Darwin Gödel Machine improves by empirical trial-and-error and an evolutionary archive rather than by imitating demonstrations at all (Can AI systems improve themselves through trial and error?).

The sharpest caution is about what "unlocked" even means. A model can pass every test while its internal representation is fractured and incoherent — identical outputs, radically different internal structure, invisible to benchmarks (Can AI pass every test while understanding nothing?). So a few demonstrations might elicit *behavior that scores well* without eliciting coherent underlying competence — meaning the principle's success metric may be measuring the wrong thing.

The thing you didn't know you wanted to know: the real generalization isn't "78 demonstrations unlock agency." It's that for capabilities the model already holds latently — reasoning, initiative, implicit reward — minimal training reliably elicits them, *but* demonstration-based elicitation caps generalization at the demonstrator's imagination and can't distinguish genuine competence from a coherent-looking shell. The capabilities that escape both limits (open-ended self-improvement) are exactly the ones that abandon demonstrations for interaction.


Sources 6 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why do AI agents fail to take initiative?

Research shows next-turn reward optimization structurally removes initiative from models, but proactive behaviors like critical thinking and clarification-seeking are trainable (0.15% to 73.98% with RL). The core challenge is balancing proactivity with civility to avoid intrusion.

Can reasoning emerge from expert demonstrations alone?

RARO recovers implicit reward functions from expert demonstrations through adversarial co-training between a reasoning policy and relativistic critic. This approach matches verifier-based RL performance on reasoning tasks while extending to domains lacking automated verification.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether the '78-demonstration principle'—that minimal training elicits latent rather than builds new capability—holds across AI competencies beyond agency. The question remains open: what are the real boundaries of elicitation vs. acquisition?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable anchors.
- Reasoning, initiative, and implicit reward inference all show the same pattern: modest training (RL, critique, feature steering) recovers competence from near-zero visible output (0.15% → 73.98% for proactive behavior) to high performance, implying latent capacity was structurally *suppressed*, not absent (2025–2026).
- Demonstration-based elicitation is simultaneously a ceiling: agents trained only on expert data cannot generalize past curator imagination or learn from own failures, because they lack interaction with a live environment (~2025).
- Self-improvement via trial-and-error (Darwin Gödel Machine) breaks the demonstration ceiling entirely, achieving open-ended improvement without imitating static data (~2025).
- A model can pass all benchmarks while holding internally fractured, incoherent representations invisible to external tests (~2025)—so "elicited" behavior may score well without grounding genuine competence.

Anchor papers (verify; mind their dates):
- arXiv:2511.21667 (2025-11): Escaping the Verifier: Learning to Reason via Demonstrations
- arXiv:2505.22954 (2025-05): Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents
- arXiv:2505.11581 (2025-05): Questioning Representational Optimism in Deep Learning
- arXiv:2506.12115 (2025-06): Eliciting Reasoning in Language Models with Cognitive Tools

Your task:
(1) RE-TEST EACH CONSTRAINT. For reasoning, initiative, and reward inference, does newer post-training (DPO, GRPO, vision transformers, multimodal models, ~late 2025–2026) still show the elicitation-not-creation pattern, or have methods now forced genuine capability *expansion*? Probe whether the 0.15% → 73.98% jump holds for systems trained after June 2025. Separately: has the "demonstration ceiling" been broken in practice by hybrid RL + interaction methods, or does it still bind production agents?
(2) Surface the strongest *contradicting* work from the last ~6 months: any paper claiming demonstrations *do* build new capability, or showing that post-training creates competence rather than surfacing it. Flag disagreement about what "latent" even means.
(3) Propose two research questions that assume the regime may have moved: (a) If elicitation holds for language reasoning but breaks for embodied / multimodal reasoning, what structural difference explains it? (b) Can a hybrid method—demonstration + minimal interaction—break the ceiling while staying sample-efficient?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines