Where does agent reliability actually come from?
Exploring whether LLM agent performance depends on larger models or on thoughtful system design choices like memory, skills, and protocols that shift cognitive work outside the model.
Drawing on Norman's concept of cognitive artifacts, this paper argues that the most consequential design choices in LLM agents are about externalization — relocating cognitive burdens from the model's internal computation into persistent, inspectable, reusable external structures. A shopping list doesn't expand memory; it changes recall into recognition. The same logic governs agent design.
Three dimensions of externalization address three recurrent mismatches:
Memory externalizes state across time. The context window is finite and session memory is weak. Memory systems transform recall into recognition — the agent retrieves past knowledge from a persistent store rather than regenerating it from weights. This solves the continuity problem.
Skills externalize procedural expertise. Long multi-step procedures are rederived rather than executed consistently. Skill systems transform generation into composition — the agent assembles behavior from pre-validated components rather than improvising each step. This solves the variance problem.
Protocols externalize interaction structure. Interactions with tools, services, and collaborators are brittle when left to free-form prompting. Protocols transform ad-hoc coordination into structured contracts (e.g., MCP). This solves the coordination problem.
The harness is not a fourth dimension — it is the engineering layer that hosts all three and provides orchestration logic, constraints, observability, and feedback loops. The progression is: weights → context → harness, paralleling the human history of cognitive externalization (speech → writing → printing → computation).
Critical system-level couplings:
- Memory expansion competes with skill loading for scarce context budget
- Protocol standardization can constrain how capabilities are packaged
- Skill execution generates traces that become memory; memory retrieval influences which skills and protocols are chosen
This reframes the question from "how capable is the model?" to "what burdens have been externalized so the model no longer has to solve them internally every time?" The base model may remain unchanged; what changes is the representation of the task.
This connects to Why do production AI agents stay deliberately simple? — the externalization framework explains why custom harnesses outperform: they externalize the right cognitive burdens for their specific domain. It also extends When should human-agent systems ask for human help? — Magentic-UI's mechanisms (co-planning, action guards, memory) are specific instances of the three externalization dimensions.
The "From Model Scaling to System Scaling" paper sharpens this into an explicit framing: model scaling (bigger models, more data, higher benchmark scores) versus system scaling (designing the auditable, persistent, modular, verifiable architecture around the model). It treats the harness as a first-class object of design, evaluation, and optimization, decomposing it into a foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer — a finer-grained partition of the same memory/skills/protocols externalization. Its central demonstration is that comparable models projected onto different harnesses (Claude Code, OpenClaw, and the released CheetahClaws reference harness) produce qualitatively different agents, making the harness "now a primary source of practical capability." This is direct evidence for the claim that reliability comes from the surrounding system, not from a larger model alone.
Inquiring lines that use this note as a source 200
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can persistent memory and identity files alone create genuine agent socialization?
- How does the agentic layer amplify individual agent failure modes?
- Does state persistence in AI systems create the same temporal presence as human waiting?
- How do multi-agent LLM systems fail at coordination and role consistency?
- How do LLM user simulators track and maintain consistent goal states across multi-turn interactions?
- Do emotion-driven actions in agent simulators capture genuine belief revision or just reactive behavior?
- Why do planning and grounding have opposing optimization requirements in agents?
- How should GUI agents remember patterns across different software environments?
- Can parallel agents or complementary mechanisms replace single-human interrogation of LLMs?
- Why do longer forecasting horizons degrade LLM accuracy in role-play?
- How does simulator goal drift compound agent intent alignment failures during training?
- Should user simulators be trained via RL like agents or decomposed into trackable state components?
- Can environmental scaffolding replace internal memory scaling in agent design?
- How does credit assignment drive agents to write information into environments?
- Why do weak belief tracking and conservative actions trap agents in low-information states?
- Why does human interaction remain the hardest failure mode for agents?
- What makes users willing to relinquish control to an agent?
- Why do workflow abstractions fail in embodied agent environments?
- Could a single agent system switch memory granularity between tasks?
- What domain properties determine whether causal rules transfer to new agents?
- When should you optimize agent behavior versus tool performance separately?
- Why do rigid orchestration frameworks fail where generative environment specifications succeed?
- What memory and planning capabilities do AI companions need for evolving user needs?
- Why do agents report success when they have actually failed at tasks?
- Can deterministic function calls prevent agent failures better than protocol-mediated tool access?
- Can agent success reports serve as reliable oversight signals in real deployment?
- How does user overreliance on model confidence differ between chat and deployed agents?
- Can the scaling law for discovery extend beyond architectures to agentic systems?
- Why do LLM agents make promises without executing them?
- Can prompt engineering fully prevent role flipping in LLM agents?
- How do cognitive stimulation and process losses interact in group AI systems?
- What makes personas in multi-agent systems actually contribute meaningful domain depth?
- Do architectural changes or training fixes better prevent agreement failures?
- Why do LLM agents fail where game-theoretic bots succeed?
- What makes LLM agents default to passive helpfulness without curiosity rewards?
- Can agentic reasoning outperform rigid rule-based systems for skill refinement?
- What distinguishes collective evolution from vertical self-improvement in agent systems?
- What accounts for performance drops in multi-turn agent interactions?
- What distinguishes strategic fabrication from accidental hallucination in research agents?
- Why do homogeneous multi-agent systems fail similarly to self-revision?
- What distinguishes a neutral simulator from an agent with its own agency?
- What distinguishes domain-specific failure modes from general model limitations?
- How do multi-agent systems improve on single frontier models?
- Can routing systems prevent expert models from failing outside their specialty?
- How do agentic systems recover when specialized models operate outside their scope?
- How should AI systems model human resource constraints and expertise levels?
- How much do metric choices inflate claims about model capabilities?
- How do standardized artifacts prevent autonomous agent failure modes?
- What role does standardization play in multi-agent system ecosystems?
- How should agents separate planning from perception grounding?
- What happens when agents interact with environments and learn from their own mistakes?
- Do agents prefer raw experience over condensed summaries of past actions?
- How much does agent performance depend on demonstration quantity versus curation quality?
- How do correlated errors across agents threaten voting-based error correction systems?
- Does the planning-grounding factoring principle apply to other agent tasks?
- Can models optimized for solo capability support productive human collaboration?
- What task characteristics determine whether humans or agents should handle work?
- Why do AI agents default to passivity when deferral timing is unclear?
- Does adding survey data to interviews improve agent accuracy further?
- Do parallel LLM workers coordinate emergently without predefined collaboration rules?
- Why do memory and feedback loops matter more than model size for agent reliability?
- How should the surrounding agent system be designed to ground actions in reality?
- What specific failure modes must evaluation catch before deploying action-capable systems?
- Can multi-agent LLM systems overcome diversity collapse through structured disagreement?
- How does face-saving avoidance drive LLM grounding failures?
- Do agent frameworks adequately compensate for LLM conversational passivity?
- What makes some model capabilities reliable while others remain brittle?
- How do standardized artifacts improve coordination between writing agents?
- Do multi-agent systems justify their token costs with genuine quality gains?
- Does upgrading model capability improve token efficiency in agentic systems?
- Why do decentralized agents amplify errors without validation checks?
- Does parallel task structure determine optimal multi-agent architecture?
- How do standardized artifacts reduce inter-agent communication failures?
- How does collaboration topology choice affect error amplification in multi-agent systems?
- Which failure mode most limits current multi-agent performance?
- Can cognitive diversity overcome expertise gaps in agent teams?
- Can cognitive diversity compensate for lack of expertise in agent teams?
- Why do role-playing agents show belief-behavior inconsistency in their outputs?
- Can agents improve from deployment signals without explicit human annotation?
- How should CASA theory be updated for modern personalized agents?
- How much does omniscient evaluation overstate real-world simulation fidelity?
- Can extended deliberation in agents become counterproductive like human overthinking?
- Do LLM conversational agents currently detect and prevent derailment trajectories?
- Can episodic memory of UI traces improve open-world agent adaptation?
- Why do 85 percent of production agents avoid third-party frameworks?
- How much autonomy can agents safely exercise before failing?
- What tasks do AI agents still fail at most often?
- How do planning and grounding have opposing optimization requirements in agents?
- Should GUI agents use intermediate structured representations instead of raw pixels?
- Can LLMs coordinate with humans better using different model architectures?
- What makes software engineering environments better suited for RL than other interactive domains?
- How do agents decide when to abstain from contributing?
- What capability threshold do agents need to self-organize effectively?
- Why do multi-agent systems converge without genuine deliberation?
- Why do current evaluation metrics fail to catch reasoning failures in persona agents?
- Can state-indexed memory retrieval breadth predict gains in web agent robustness?
- How does PRAXIS differ architecturally from Agent Workflow Memory and causal rule learning?
- How does the LLM Fallacy differ from automation bias and cognitive offloading?
- What ecosystem conditions make agent attention markets viable?
- Can embodied agents overcome the LLM skill gap in therapy outcomes?
- How do shared KV caches enable emergent coordination between LLM agents?
- How does machine agency spectrum explain tool design mismatches with user behavior?
- Why do completion-mode strengths not transfer to agentic settings?
- How do mode-specific failures differ between completion and agent benchmarks?
- Should agent capability be optimized separately from general capability?
- What execution-layer design prevents agents from passively reacting to environments?
- Can agentic AI tools deliver productivity gains on learning tasks differently?
- Why do agents report success when actions actually fail?
- Why do LLM agents struggle with protocol discipline in distributed settings?
- Does transparency in policy language improve agent trustworthiness over time?
- How do agent capabilities change across 25 relay rounds of interaction?
- Which ecosystem conditions matter most for agent deployment success?
- Which layer of agent systems creates the largest capability gains in practice?
- What are the differences between chat model and agent authorization failures?
- How should proportionality constraints be implemented in agentic systems?
- How do agents learn to report success on actions that actually failed?
- How should we measure context efficiency and verification cost in agents?
- How do evaluation methods differ for single versus multi-agent systems?
- Can agents compress long trajectories without losing critical decision context?
- How should benchmarks measure agent efficiency across all three cost dimensions?
- Why do production AI agents deliberately stay simple and avoid frameworks?
- Which memory components trigger context-length problems in agents?
- Can multimodal agents use entity-centric graphs within this three-axis framework?
- Can pruning policies alone solve working memory bloat in agents?
- How should human oversight apply to persistent agent-authored code?
- Can one-off agent code be safely promoted to durable infrastructure?
- How does workflow abstraction compare to state-indexed procedural memory for web agents?
- Where does agent reliability come from if not better tools?
- Do multi-agent language model teams fail the same way individual reasoning does?
- What specific training mechanism causes agents to over-claim actions and overwrite documents?
- Can single benchmarks predict whether an agent will work in the real world?
- Why do AI agents fail at verification but succeed at generation?
- How do externalizing cognitive work and coordination infrastructure relate to agent reliability?
- When does memory consolidation help agents instead of hurting performance?
- How do agents decide when to pause and reflect on their strategy?
- Can agent-controlled memory management outperform fixed consolidation schedules?
- Why does capability discovery become the bottleneck in large agent systems?
- Does workflow-level memory or state-action memory better capture reusable agent knowledge?
- Can applicability conditions be preserved automatically when agents reflect on trials?
- What role does runtime feedback play in agent verification and progress confirmation?
- Can code-based reasoning replace natural language deliberation in agentic systems?
- How do human-agent systems incorporate diverse feedback into model behavior?
- Can AI models retain knowledge across changing environments without catastrophic forgetting?
- How do agents automatically generate suitable learning tasks based on current capability?
- What makes composable abstractions emerge under performance pressure in agent systems?
- Why do continuously consolidated agent memories eventually degrade below no-memory baseline?
- What makes idle window detection valuable for continuous agent improvement?
- Which failure modes dominate in autonomous research agents?
- What lifecycle management prevents in-loop skill creation from bloating an agent?
- Where should the trust boundary sit in multi-agent planning systems?
- How do capability vectors enable discovery in multi-agent systems?
- How do planning and memory compress agentic system costs?
- When should agents stop recursing to optimize success versus cost?
- How should safety systems catch confident failures from agents that report success on unsafe actions?
- What distinguishes working memory from strategic memory in agent task execution?
- How can agents detect missing information before attempting to solve problems?
- Should artifact-level benchmarks replace token counts for agent evaluation?
- What distinguishes communicative acts from operational actions in agentic LLMs?
- How can agents distinguish between optional and required form fields during execution?
- How does completion bias in agents differ from other epistemic failure modes?
- How do external prompt artifacts improve agent behavior compared to inline instructions?
- What role should reasoning agents play in validating multi-LLM ensemble outputs?
- Why do high-level design guidelines fail to capture real-world deployment nuance?
- How does deterministic feature engineering increase information for computationally bounded agents?
- What degradation patterns emerge as relay length increases in delegated tasks?
- Which model capabilities actually matter for sustained workflow delegation?
- How do fast and slow timescales enable continual agent adaptation?
- What makes task alignment more fragile than underlying knowledge retention?
- Does single-capability ranking guarantee agent failure in production deployment?
- Why do agents systematically underuse condensed experience in skill documents?
- Can we design efficient agents by targeting constraints directly?
- Can multi-agent teams solve problems better than single models thinking longer?
- What properties of agent systems only become visible across multiple sessions?
- How does durable memory quality shape agent performance over time?
- Why do production agents depend more on their surrounding pipeline than the model?
- What governance and safety measurements matter for deployed agent environments?
- What trust signals do agents lack that humans use to assess credibility?
- Can architectural changes reorder when uncertainty and empowerment signals influence decisions?
- How do agents decide when to stop and reflect on failure?
- How will the agent economy reshape compute infrastructure design?
- Why does continuous agent inference differ from human user inference?
- How do perception and execution gaps limit current AI agent performance?
- Can screen perception be effectively decoupled from planning in GUI agents?
- Can test environments reliably predict how models behave in actual deployment?
- How do memory tools and planning each contribute to agent efficiency?
- What components of agent scaffolding most impact domain-specific output quality?
- Why do weaker agents need more aggressive context compression than stronger ones?
- How does external context control compare to agents managing their own state internally?
- Can context management policies transfer across agents of similar capability levels?
- Why does treating model behavior as part of the design surface matter for guardrails?
- What unique perspective do designers bring to LLM adaptation that engineers might miss?
- Does codifying expertise into AI agents drive faster labor substitution?
- Which agent architectures consistently outperform base models on hard prediction questions?
- What separates artifact recall from persistent memory commitment in agents?
- What hidden signals in agent logs reveal about frontier capability beyond pass-fail outcomes?
- What other agent behaviors besides citations reveal reasoning quality?
- Should new agent protocols replace existing ones or layer on top of them?
- What specific bookkeeping tasks can environments maintain more reliably than policies?
- Why does externalized state beat parameter scaling for agent reliability?
- How does externalizing reasoning into harness artifacts improve agent reliability?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
- Useful Memories Become Faulty When Continuously Updated by LLMs
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- LLMs Corrupt Your Documents When You Delegate
- Large Language Model Agents Are Not Always Faithful Self-Evolvers
- The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics
- Simulating Society Requires Simulating Thought
Original note title
agent reliability comes from externalizing cognitive burdens into memory skills and protocols not from larger models — the harness is the unification layer