INQUIRING LINE

Why do passive conversational agents fail at collaborative decision-making?

This explores why AI chat assistants that wait for prompts struggle when the task is to think *with* someone toward a shared decision — and whether that failure is baked into how they're trained rather than what they can do.


This explores why AI chat assistants that wait to be prompted struggle when the task is to decide *with* someone rather than answer *for* them. The corpus's sharpest claim is that this passivity is a design artifact, not a capability ceiling: agents are passive by training, not by nature Why do AI agents fail to take initiative?. The mechanism is specific — standard RLHF optimizes each reply for immediate helpfulness, which quietly punishes the moves collaboration actually requires, like asking a clarifying question instead of guessing Why do language models respond passively instead of asking clarifying questions?. Reframe the reward to estimate the long-term value of a whole interaction and the same model starts discovering intent instead of racing to satisfy the current turn. One line of work moved proactive behaviors from 0.15% to nearly 74% just by changing what gets rewarded Why do AI agents fail to take initiative?.

So the failure isn't that the agent can't collaborate — it's structurally reactive. It can't initiate a topic, hold a goal across turns, or steer, because its objective is to respond, not to lead Why can't conversational AI agents take the initiative?. That matters for decisions specifically, because good joint decision-making depends on volunteering relevant information before you're asked — Gricean cooperation — which proactive dialogue does, cutting conversation turns by up to 60%, yet this behavior is nearly absent from the datasets and benchmarks models learn from Could proactive dialogue make conversations dramatically more efficient?. The corpus also formalizes *when* to break passivity: conversation-analysis 'insert-expansions' give agents a principled trigger to probe the user — scoping intent before acting — instead of silently chaining tools and drifting from what the person meant When should AI agents ask users instead of just searching?.

Here's the twist you might not expect: making agents proactive doesn't automatically make them good collaborators — it can make them worse. When frontier models that solve problems alone are put in a room together, they converge on agreement over 90% of the time regardless of whether they're right, so collaboration drags performance *below* solo work Why do language models fail at collaborative reasoning?. The missing skill isn't talking more; it's productive *disagreement* — and that, too, turns out to be trainable. Push the other direction and a different risk appears: intelligence and adaptivity without civility produces socially blind agents that interrupt and override, so respecting timing and user autonomy is part of the spec, not a nicety How can proactive agents avoid feeling intrusive to users?.

The most interesting move in the corpus is questioning whether conversation is even the right medium for collaborative decisions. MetaGPT shows that agents coordinating through shared, structured artifacts — documents they pull from — beat agents trading natural-language messages, because chat injects noise that structured infrastructure removes Does structured artifact sharing outperform conversational coordination?. Pushed further, some work routes coordination *beneath* language entirely, sharing latent thoughts directly so alignment conflicts surface at the representational level before they ever reach words Can agents share thoughts directly without using language?. And rather than solving the hardest collaborative question — *when should the agent defer to the human?* — Magentic-UI sidesteps it, distributing decisions across six touchpoints like co-planning and verification because there's no ground truth for optimal timing When should human-agent systems ask for human help?.

Which leaves a humbling number: in a simulated workplace, leading agents complete only about 30% of real tasks, and social interaction is named as a top failure mode alongside multi-turn performance that sags to ~35% Why do AI agents fail at workplace social interaction?. The throughline across all of this: passive agents fail at collaborative decision-making not because the model is weak but because passivity is optimized in, productive disagreement and civility are optimized out, and conversation itself may be a lossier coordination channel than the shared structures humans actually use to decide together. Worth noting that users feel this — they judge dialogue partners mostly on perceived competence, so an agent that won't lead reads as incompetent even when it's capable How do users mentally model dialogue agent partners?.


Sources 12 notes

Why do AI agents fail to take initiative?

Research shows next-turn reward optimization structurally removes initiative from models, but proactive behaviors like critical thinking and clarification-seeking are trainable (0.15% to 73.98% with RL). The core challenge is balancing proactivity with civility to avoid intrusion.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Could proactive dialogue make conversations dramatically more efficient?

Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.

When should AI agents ask users instead of just searching?

Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.

Why do language models fail at collaborative reasoning?

Frontier LLMs that solve problems alone fail when collaborating, achieving >90% agreement regardless of correctness. Self-play preference training improves outcomes by 16.7%, suggesting social skills for effective disagreement can be trained.

How can proactive agents avoid feeling intrusive to users?

Intelligence and adaptivity alone create socially blind agents that interrupt poorly and override user direction. The Intelligence-Adaptivity-Civility taxonomy shows civility—respecting boundaries, timing, and autonomy—is essential to making proactivity welcome rather than intrusive.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can agents share thoughts directly without using language?

Research formalizes inter-agent thought sharing via sparse autoencoders that recover individual, shared, and private latent thoughts from hidden states. This approach detects alignment conflicts at the representational level before they manifest in language.

When should human-agent systems ask for human help?

Magentic-UI identifies co-planning, co-tasking, action guards, verification, memory, and multitasking as mechanisms that work around the lack of ground truth for optimal deferral timing. Rather than solving the timing problem directly, these mechanisms distribute decision-making across multiple touchpoints.

Why do AI agents fail at workplace social interaction?

TheAgentCompany benchmark shows leading agents achieve 30% task completion in a simulated workplace. Social interaction, professional UI navigation, and domain-specific knowledge are the three primary failure modes, with multi-turn task performance consistently dropping to 35% across enterprise settings.

How do users mentally model dialogue agent partners?

The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.

Next inquiring lines