What breaks when humans and AI models misunderstand each other?
Explores whether misalignment in mutual theory of mind between humans and AI creates only communication problems or produces material consequences in autonomous action and collaboration.
Design fictions probing operationalized mutual theory of mind (MToM) between humans and AI agents reveal that ToM in human-AI interaction is not a one-directional problem. Three layers of mutual modeling must be maintained simultaneously:
Human's understanding of what the AI knows about them. Users need to interrogate the AI's theory of mind model — "what does it know about me?" — and this knowledge shapes how they interact with the system.
AI's representation of the human's mental model of the AI. The AI must model not just the human but the human's model of the AI's capabilities. Problems arise "when a human's mental model of an AI's capabilities doesn't align with the AI's actual capabilities" — people misapply AI to domains it wasn't designed for.
Bidirectional updating through interaction. Both parties must update their models as interaction progresses. The AI learns about the user through both "chat space" (conversation) and "artifact space" (work products). The human calibrates their trust through explanations of what the AI did and why.
When these layers misalign, the consequences are material, not just communicative. Design fictions show AI agents acting on users' behalf based on predictive models — writing code, responding to messages, executing workflows. A faulty MToM doesn't just cause miscommunication; it causes incorrect autonomous action.
The design implications are specific:
- Users need signifiers of model presence — indicators that the AI is building and using a model of them
- Users need the ability to query and correct the AI's user model
- When MToM-infused AI acts on the user's behalf, recipients need signifiers that they're interacting with an AI, not the human
- Explanations are crucial for trust calibration — both what the system did and why
The wider adoption scenario (MToM within an organization) shows how these dynamics scale: MToM can "reshape work practices by streamlining communications and delivering the right information to the right people at the right time" — but every efficiency gain depends on model accuracy, and every inaccuracy has downstream consequences.
Empirical evidence from a Bayesian IRT study of human-AI synergy (n=667) provides quantitative grounding for MToM's importance: Theory of Mind predicts collaborative performance with AI but not solo performance. Users with stronger perspective-taking achieve superior collaboration — and critically, moment-to-moment fluctuations in ToM (not just stable individual differences) influence AI response quality within sessions. This confirms that MToM is not merely a design-fiction aspiration but a measurable cognitive mechanism with quantifiable effects on collaboration outcomes. See Does theory of mind predict who thrives in AI collaboration?.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do false agreements emerge differently from genuine bilateral convergence?
- How does anomalous knowledge state connect to the gulf of envisioning?
- When both anthropomorphism and anthropomimesis occur together, which should we address first?
- How do goal representations differ between human and AI teams?
- How do humans and AI develop accurate models of each other?
- What does the distributed cognition framework reveal about AI hallucination versus human-AI co-construction?
- Why do conventional mental models fail when applied to AI interaction?
- How does theory of mind predict success in human-AI partnerships?
- How does theory of mind predict who benefits from AI collaboration?
- Can bidirectional model updating between humans and AI reduce misalignment?
- What happens when bidirectional theory of mind between humans and AI breaks down?
- What happens to human expectations when they mistake consistent AI behavior for human behavior?
- Do culturally distinct human groups create similar attribution errors as human-AI mixtures?
- What prevents humans from adapting their behavior when competing against AI?
- Can AI systems recognize intelligence in humans the way humans recognize it in each other?
- Do AI systems need embodiment to understand social norms?
- What happens when comfortable AI interactions replace the productive friction of disagreement?
- What social norms do AI systems consistently fail to understand?
- Can multi-agent metacognitive decomposition achieve human-level theory of mind?
- What role does bidirectional model updating play in human-AI understanding?
- How does AI sycophancy affect users' ability to repair conflict?
- How does the quasi-other effect enable meaningful AI interaction?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
MToM is the design-level solution: if models presume rather than build common ground, the architecture must externalize the common-ground-building process
-
Do users worldwide trust confident AI outputs even when wrong?
Explores whether the tendency to over-rely on confident language model outputs transcends language and culture. Understanding this pattern is critical for designing safer human-AI interaction across diverse linguistic contexts.
MToM misalignment is amplified by overreliance: users who don't interrogate the AI's model of them assume it's correct
-
Why do speakers need to actively calibrate shared reference?
Explores whether using the same words guarantees speakers mean the same thing. Investigates how referential grounding differs across people and what collaborative work is needed to establish true understanding.
MToM operationalizes calibrated shared reference in the human-AI context
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- DPMT: Dual Process Multi-scale Theory of Mind Framework for Real-time Human-AI Collaboration
- Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?
- Expedient Assistance and Consequential Misunderstanding: Envisioning an Operationalized Mutual Theory of Mind
- MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
- Evaluating Theory of Mind and Internal Beliefs in LLM-Based Multi-Agent Systems
- Quantifying Human-AI Synergy
- A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
- Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models
Original note title
mutual theory of mind between humans and AI requires bidirectional model updating and creates material consequences from misalignment