What properties of agent systems only become visible across multiple sessions?

This explores which agent behaviors are invisible in a single run and only show up once an agent persists, learns, or coordinates across many sessions — the properties that single-shot evaluation can't catch.

This reads the question as: what about an agent system is fundamentally a *longitudinal* property — something you literally cannot observe in one session, only across a history of them? The corpus points to several, and they cluster around memory, learning, and coordination.

The first is whether an agent actually *gets better* — or just doesn't get worse. Within a session an agent either succeeds or fails; across sessions you see learning curves and forgetting. Can agents learn new skills without forgetting old ones? frames lifelong learning as exactly this multi-session property: an agent that stores executable skills and composes new ones from old can keep improving, while a weight-updating agent quietly suffers catastrophic forgetting that's only visible when you revisit an old task. Can agents learn continuously from experience without updating weights? makes the mechanism concrete — adaptation happens entirely through accumulated episodic memory rather than parameter changes, so 'how much has this agent improved' is a question that only has meaning over time. Can agents adapt without pausing service to users? sharpens it further: there are two clocks running, fast skill-injection within minutes and slow gradient optimization over idle hours, and the two only reinforce each other across many sessions.

The second invisible property is memory hygiene. A single session never reveals whether an agent's memory is well-structured or quietly rotting. Can agents compress their own memory without losing critical details? shows that consolidation done badly degrades the agent — but that degradation only surfaces session after session as history piles up. How should agent memory split across time scales? adds a useful lens: memory isn't one thing, and the dialogue-level components (conversation history, scratchpad) have completely different failure and update patterns than turn-level ones — distinctions that matter precisely because they play out over the lifetime of an agent, not one turn.

The third is collective and cross-user behavior, which by definition can't appear in any single conversation. How can agent systems share learned skills across users? describes skills that improve by aggregating trajectories across many users and many sessions — siloed individual learning becoming shared capability. And Where does agent reliability actually come from? is the quiet unifier here: reliability itself turns out to be a multi-session property, because it comes from a persistent harness layer (memory, skills, protocols) that exists *between* sessions, not from anything the model does inside one.

The surprise worth taking away: even *coordination failure* is partly longitudinal. Why do multi-agent systems fail to coordinate at scale? shows agents accepting neighbors' claims without verification, letting errors propagate — a pathology that compounds over repeated interaction. So the things that only become visible across sessions aren't edge cases; they're the properties that actually decide whether an agent system is trustworthy. The single-session demo is the part that lies to you.

Sources 8 notes

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can agents adapt without pausing service to users?

MetaClaw demonstrates that deployed agents require both rapid skill injection from failures (seconds, zero downtime) and slower gradient-based optimization during idle windows (minutes to hours). The two mechanisms reinforce each other, with better policies producing more informative failures and richer skills enabling higher-reward trajectories.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

How can agent systems share learned skills across users?

SkillClaw aggregates interaction trajectories across users, processes them through an autonomous evolver that identifies patterns and refines skills, then synchronizes updates system-wide. This converts siloed individual learning into shared capability improvement without manual curation.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: *What properties of agent systems only become visible across multiple sessions?* remains open, but a curated library (2025–2026) has sketched an answer. Your task is to pressure-test it.

What a curated library found — and when (dated claims, not current truth):
• Lifelong learning (improvement, not just non-failure) only appears longitudinally; single-session demos hide catastrophic forgetting in weight-updating agents, visible only on task revisit (~2025).
• Memory hygiene degrades predictably across sessions; dialogue-level and turn-level memory components fail on different timescales, compounding over agent lifetime (~2026).
• Reliability is a multi-session property rooted in externalizing cognitive burdens (memory, skills, protocols) into a persistent harness layer between sessions, not model internals (~2026).
• Cross-user skill aggregation and collective learning require centralized trajectory pooling across many sessions; siloed single-session learning cannot surface shared capability improvements (~2025–2026).
• Coordination failure (error propagation via unverified neighbor claims) compounds over repeated interaction in distributed multi-agent systems (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2604.08224 — Externalization in LLM Agents (2026-04)
• arXiv:2605.12978 — Useful Memories Become Faulty When Continuously Updated (2026-05)
• arXiv:2603.17187 — MetaClaw: Meta-Learning in the Wild (2026-03)
• arXiv:2506.02153 — Small Language Models are the Future of Agentic AI (2025-06)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether post-2026 models (o1, o3, Claude 4, Gemini 3), novel training regimes (online RL, curriculum learning, multi-agent co-training), orchestration advances (persistent memory backends, skill caches, agentic harnesses), or evaluation methods have RELAXED or OVERTURNED it. Separate the durable question (still open) from perishable limitation (possibly solved). Cite what resolved it; flag where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.
(3) Propose 2 research questions that ASSUME the multi-session regime may have fundamentally shifted — e.g., are single-session demonstrations now sufficient if models are large enough? Can harness externalization alone sustain reliability without longitudinal learning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What properties of agent systems only become visible across multiple sessions?

Sources 8 notes

Next inquiring lines