SYNTHESIS NOTE
Agentic Systems and Tool Use

Can careful selection of 78 demos outperform massive training datasets?

Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.

Synthesis note · 2026-02-23 · sourced from Agents

The LIMI paper challenges the core assumption that agentic capability scales with training data volume. Using only 78 carefully designed training samples — capturing complete multi-turn interaction sequences including tool use, reasoning, and environmental feedback — LIMI achieves 73.5% on AgencyBench, dramatically outperforming Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI shows 53.7% improvement over models trained on 10,000 samples.

Three innovations drive this:

  1. Agentic query synthesis — human-AI collaborative collection from real-world scenarios plus systematic GitHub PR-based synthesis, ensuring ecological validity
  2. Complete trajectory collection — full multi-turn sequences from task understanding through tool utilization to successful completion, not isolated demonstrations
  3. The Agency Efficiency Principle — machine autonomy emerges from strategic curation, not data accumulation

This extends a pattern now documented across three capability domains: reasoning (LIMO achieved complex math with 817 samples), instruction-following (LIMA achieved alignment with 1,000 examples), and now agency. Because Do base models already contain hidden reasoning ability?, the mechanism is likely the same: curated demonstrations activate latent agentic patterns already embedded through pretraining on code, documentation, and workflow descriptions. The training data doesn't teach agency — it triggers the phase transition from passive language model to active agent.

The practical implication challenges the resource-intensive approach to building agentic systems. If 78 demonstrations outperform 10K, the bottleneck is data quality and trajectory design, not data volume. Since Can models improve themselves on tasks without verifiable answers?, there appears to be a consistent principle: capability activation requires showing the model what it looks like to use a capability, not exhaustive training.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 149 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agency emerges from strategic curation of 78 demonstrations not data abundance — challenging scaling paradigms for agentic intelligence