What blocks scaling from language models to autonomous agents?
If large language models excel at next-token prediction, why do they struggle with long-horizon goal-oriented tasks? This explores whether the bottleneck is model capacity or the environments used to train them.
Nex-N1's diagnosis is that the LLM-to-agent transition is blocked by a misalignment between LLM pretraining (myopic next-token prediction) and the long-horizon goal-oriented nature of agentic tasks — and that bridging this requires not better models but a new scale of interactive environments. Scarcity of diverse environments leaves models as "System 1" responders without "System 2" rigor; lack of realistic grounding produces hallucinated tool use and brittle error recovery.
The structural claim is that environments must scale on three orthogonal dimensions, and a deficit on any one ruins the resulting policy. Complexity comes from agent hierarchies — NexAU is a lightweight high-throughput runtime that decouples agent definition from execution, treating sub-agents and tools as interchangeable functional units in a recursive ReAct-like architecture. Diversity comes from automated synthesis — NexA4A generates diverse agent architectures and workflows from natural-language specifications rather than human-designed templates, breaking the dependency on hand-built environments. Fidelity comes from grounding — NexGAP integrates real Model Context Protocol (MCP) tools and information fusion, generating trajectories rooted in authentic latency, stochasticity, and feedback loops.
The orthogonality matters because earlier frameworks fail in characteristic ways: rigid graph-based orchestrators provide reliability but limit diversity; pure synthetic environments provide diversity but break on real execution. Treating environments as generative language specifications rather than static code is the move that lets all three axes scale together. The empirical signal — Nex-N1 outperforms SOTA open-source models and approaches frontier proprietary models on SWE-bench and τ2 — supports the thesis that the limiting reagent has been environments, not parameters.
This stands in productive tension with Can careful selection of 78 demos outperform massive training datasets?, which argues that strategic data curation beats environment-scale; the resolution is likely that environment richness sets a ceiling that curated data exploits, not a substitute for curation.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do language models fail at planning despite understanding strategies?
- What scaling laws govern autonomous architecture discovery in AI systems?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Why does AI code generation lag behind pattern-matching benchmarks?
- Why do language models plateau at constraint satisfaction regardless of scale?
- Which agent architectures consistently outperform base models on hard prediction questions?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can careful selection of 78 demos outperform massive training datasets?
Does strategic curation of high-quality demonstrations unlock agentic capability more efficiently than scaling training data? LIMI achieved 73.5% on AgencyBench with 78 samples versus 10K+ samples for competing models, suggesting data quality may matter more than quantity.
tension with: LIMI argues 78 curated demos beat data abundance; Nex-N1 argues environments are the limiting reagent and must scale. Both can be true if environment richness sets the curation ceiling.
-
Can agents learn beyond what their training data shows?
Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.
complements: explains why diverse environments matter — curated demos cap exploration to what curators imagined; environment scaling breaks that ceiling.
-
Can you turn an LLM into an agent by just fine-tuning?
Explores whether upgrading language models to action-producing systems requires only model retraining or demands a broader pipeline transformation including data collection, grounding, integration, and safety evaluation.
extends: LAM defines the pipeline structure; Nex-N1 specifies what environment scaling must look like at the data-collection and grounding stages.
-
Can agent deployment itself generate training signals automatically?
Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
complements: high-fidelity environments produce informative next-state signals; the value of next-state learning depends on environment fidelity.
-
Why does random tool sampling produce unrealistic synthetic training data?
Tool-calling datasets generated through random sampling and single-turn framing lack the complexity and coherence of real deployment. This explores what structural choices in data synthesis determine whether models can learn realistic tool composition.
exemplifies: ToolFlow is the diversity-and-fidelity argument applied to one specific data-generation pipeline.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
- Chain-of-thought Reasoning Is A Policy Improvement Operator
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Artifacts as Memory Beyond the Agent Boundary
- Scaling Laws for Agent Harnesses via Effective Feedback Compute
- AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors in Agents
- FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
- From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
Original note title
agentic training requires environment scaling along three orthogonal dimensions — complexity diversity and real-world fidelity must scale together