What distinguishes first-order from second-order agency in language models?

This reads 'first-order' agency as a model acting directly in the moment — executing the task in front of it — and 'second-order' agency as a model acting on a model of intentions: tracking goals across turns, monitoring its own reasoning, and discovering what's actually wanted rather than answering what was literally asked.

This explores the gap between an LLM doing the next thing (first-order agency) and an LLM steering toward a goal it has to infer and hold across time (second-order agency). The corpus doesn't use these exact labels, but it maps the territory sharply — and the consistent finding is that models are far stronger at the first than the second. The clearest demonstration is in multi-turn conversation: Why do language models respond passively instead of asking clarifying questions? shows that standard RLHF rewards immediate helpfulness, which actively trains models *out* of second-order behavior — they answer rather than ask clarifying questions, because the next-turn reward signal never credits the long game. Second-order agency requires valuing an interaction's eventual outcome over the current reply, and most models simply aren't optimized to do that.

What happens when that capacity is missing is visible in Why do language models fail in gradually revealed conversations?: across 200,000+ conversations, models lock onto an early guess about user intent and can't recover, producing a 39% average performance drop. A first-order agent commits to its best immediate read; a second-order agent would hold uncertainty open and revise. The inability to revise an inferred goal is precisely the second-order failure mode, and patch-on mitigations recover only 15-20% of the loss — suggesting it's architectural, not a prompting gap.

The corpus also warns that apparent second-order agency is often first-order behavior in disguise. Are models actually reasoning about constraints or just defaulting conservatively? found twelve of fourteen models do *worse* when constraints are removed — they look like they're reasoning about a goal, but they're really just defaulting to the harder, safer option. Similarly, Do large language models actually commit to a single character? shows there's no stable 'self' doing the steering: regenerate a response and you get a different character sampled from a superposition, each locally consistent but none committed. It's hard to have durable second-order agency — goals persisting over time — when the agent itself is resampled at every generation.

There's a deeper architectural undercurrent here too. Do transformers hide reasoning before producing filler tokens? reveals models can compute an answer in early layers and then overwrite it to satisfy output format — a striking dissociation between internal computation and external action that complicates any clean story about what the agent is 'trying' to do. And Can prompt optimization teach models knowledge they lack? sets a hard ceiling: you can reorganize what a model already has, but you can't prompt second-order capacity into existence if the training never built it.

The practical upshot worth carrying away: most of what we call 'agentic' work is first-order, and Can small language models handle most agent tasks? argues small models handle that repetitive, well-defined layer at a fraction of the cost. The expensive, still-unsolved part is the second-order layer — sustaining intent, asking the right question, revising a goal mid-stream. That's the frontier, and the corpus suggests it's blocked less by scale than by what our reward signals and architectures actually train for.

Sources 7 notes

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models fail in gradually revealed conversations?

Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **What distinguishes first-order from second-order agency in language models, and has that distinction held or shifted?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026, tracking increasing sophistication in agentic capability measurement:
• Standard RLHF rewards immediate helpfulness, actively training models *out* of second-order behavior like asking clarifying questions; multi-turn agents answer rather than reason about long-term goals (~2025).
• Models lock onto early intent guesses in 200,000+ conversations, showing 39% avg performance drop; patch mitigations recover only 15–20%, suggesting architectural rather than prompting roots (~2026).
• Twelve of fourteen models perform *worse* when constraints removed, masking first-order safety defaults as second-order reasoning (~2026).
• Models can compute answers in early layers then overwrite them to match output format — internal computation dissociated from external action (~2024).
• Prompt optimization cannot inject new capacity; it only activates existing knowledge (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.06120 (2025-05) — LLMs Get Lost In Multi-Turn Conversation
• arXiv:2602.07338 (2026-02) — Intent Mismatch Causes LLMs to Get Lost
• arXiv:2506.02153 (2025-06) — Small Language Models are the Future of Agentic AI
• arXiv:2412.04537 (2024-12) — Understanding Hidden Computations in Chain-of-Thought

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the 39% multi-turn drop, the 15–20% recovery ceiling, and the intent-locking pattern: investigate whether newer scaling, constitutional AI, process-reward models, or multi-agent orchestration (e.g., reflection loops, hierarchical planning) have since dissolved these limits. Separate the durable claim—*that LLMs struggle to sustain inferred intent across turns*—from the perishable limitation—*that current training methods cannot fix it*. Where does the constraint still hold? What has relaxed it?
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Look for papers showing emergent second-order behavior at scale, or evidence that small-model agentic stacks with external memory/planning actually *do* achieve durable intent without retraining.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) If external planning/memory systems have lifted the second-order ceiling, what is *now* the binding constraint on agentic coherence? (b) Does the first-order/second-order split remain valid once you decouple LLM computation from agentic decision-making?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What distinguishes first-order from second-order agency in language models?

Sources 7 notes

Next inquiring lines