How do humans and AI develop accurate models of each other?
This explores whether humans and AI can build accurate working models of each other's minds and goals — and where that mutual modeling breaks down even when AI looks fluent.
This explores the two-way street of mutual modeling: not just whether AI can read us, but whether the loop of each side updating its picture of the other actually holds together. The corpus's sharpest claim is that it usually doesn't — and that the failure isn't merely awkward conversation. Research on mutual theory of mind What breaks when humans and AI models misunderstand each other? argues that three layers of modeling have to align at once, and when they drift apart the AI doesn't just misspeak — it takes the wrong autonomous action. A Bayesian study (n=667) found that moment-to-moment shifts in how well a human models the AI actually predict how good the AI's responses become. So accuracy here is bidirectional and fragile: it has to be re-earned turn by turn, not established once.
What would it take to do this well? One line of work says scaling data isn't the answer — you need explicit cognitive machinery. Effective "thought partners" What makes an AI a true thought partner, not just a tool? are described as needing three reciprocal ingredients: mutual understanding, legibility (each side being readable to the other), and a shared model of the world — built from Bayesian theory of mind and goal planning rather than more human feedback. That theme of shared grounding recurs in the semiotics argument Can AI systems achieve real alignment without world contact?, which warns that an AI manipulating symbols with no contact with the world can have its stated goals quietly diverge from real values. Accurate mutual models, on this view, require something to point at in common — not just matching vocabulary.
Here's the twist you might not expect: AI is already superhuman at *predicting* us in some domains, yet that prediction isn't the same as understanding. GPT-4.5 out-judged every individual human on social appropriateness across hundreds of scenarios Can AI learn social norms better than humans? — but from the *outside*, as a savant that never participated in making those norms Can AI predict social norms better than humans?. The same split shows up as statistical mastery sitting right next to social blindness Why do AI systems fail at social and cultural interpretation?: top-percentile norm prediction alongside regressions on theory-of-mind tasks. So an AI can hold an eerily accurate model of your behavior while lacking the participatory understanding that would let it model your *meaning*. This is why expert judgment is called irreducibly communicative Can AI replicate the communicative work experts do? — experts model their audience's acceptance, work the AI's fluent confidence can mimic without performing, making its output epistemically misleading.
The deepest reframe in the corpus is about what kind of difference we're even measuring. Borrowing Habermas's observer/participant split Do humans and LLMs differ fundamentally or just superficially?, humans and LLMs look utterly different as systems viewed from outside, yet inside a shared conversation both draw on the same symbolic substrate — so the gap is structural, not absolute. That matters for accuracy, because it suggests mutual modeling can sometimes work at the level of discourse even when the underlying systems have nothing in common. And there's a hint of how shared models form from scratch: agents under cooperative pressure spontaneously invent compact shared abstractions Can communication pressure drive agents to learn shared abstractions? — accurate mutual models may be less something you install and more something that emerges from the need to coordinate. The cautionary note is responsibility: when AI seems human-like, designed mimicry and user-projected qualities create separate accountability paths Who bears responsibility when AI seems human-like?, which means an inaccurate human model of the AI (over-trusting its apparent understanding) is partly engineered and partly projected — and fixing it means choosing which one you're targeting.
Sources 10 notes
Research shows three layers of mutual modeling must align simultaneously in human-AI interaction, and misalignment causes incorrect autonomous action, not just miscommunication. Bayesian IRT study (n=667) confirms theory of mind predicts collaborative performance and moment-to-moment ToM fluctuations influence AI response quality.
Collins et al. show that thought partners require three reciprocal desiderata grounded in behavioral science: mutual understanding, legibility, and shared world models. This demands explicit cognitive architectures—Bayesian theory of mind, resource-rationality, goal planning—rather than scaling foundation models on human feedback alone.
Peircean semiotics reveals that symbolic goal encoding without world contact and social mediation cannot guarantee correspondence to actual values. LLMs operating in pure symbol manipulation risk divergence between stated goals and real-world outcomes.
GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.
GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.
LLMs achieve 100th-percentile performance on norm prediction yet regress on theory-of-mind tasks and cannot generate culturally-resonant interpretations. The pattern shows that statistical competence coexists with absence of actual social understanding and participation.
Expertise requires anticipating audience acceptability and social validity, not just retrieving information. AI lacks the mechanism to perform this communicative work, making its fluent output epistemically misleading despite its confident form.
Applied Habermas's observer/participant distinction to AI: from outside, humans and LLMs are utterly different; from within shared discourse, both draw on the same symbolic substrate, making the difference structural rather than absolute.
ACE agents under cooperative task pressure develop shorter utterances and higher-level abstractions through neurosymbolic library learning combined with bandit-based exploration-exploitation. This demonstrates that communication efficiency emerges naturally from the need to coordinate about shared tasks.
Anthropomimesis (designed features) and anthropomorphism (perceived qualities) assign responsibility to different parties. This distinction matters because interventions must target either system redesign or user education depending on which mechanism operates.