How do language models track multiple negotiating parties' commitments simultaneously?
This explores whether language models can keep a running ledger of what each side in a negotiation has committed to — tracking two (or more) parties' goals and agreements at once, rather than the single user's intent that ordinary dialogue systems assume.
This explores whether LLMs can hold a bilateral model of commitments — what each party has offered, accepted, or conceded — instead of the single-user goal that conventional dialogue systems are built around. The corpus suggests this is genuinely hard, and the reason is structural: standard dialogue state tracking was designed to fill in one user's form ("book a table for two at 7pm"), so it has no slot for the second party's evolving demands or for the mutual agreements that only exist when both sides sign off. Negotiation breaks that assumption — agreement requires explicit buy-in from both interlocutors across multiple issues, and form-filling paradigms simply can't represent that strategic, two-sided state Why do standard dialogue systems fail at tracking negotiation agreement?.
There's a deeper machinery gap underneath the missing data structure. Tracking two parties means maintaining two belief states and updating each as the conversation moves from partial to shared understanding. Token-level LLMs don't natively do this; the cleanest attempt to add it borrows from pragmatics — Collaborative Rational Speech Acts extend the Rational Speech Acts model so that both speakers' beliefs are tracked bidirectionally across turns, using information theory to capture how the parties converge toward a shared picture Can dialogue systems track both speakers' beliefs across turns?. The fact that researchers had to bolt on an external information-theoretic framework is itself the finding: the framework supplies the bilateral bookkeeping that a vanilla LLM lacks.
A related limitation shows up when you ask the model to hold competing readings at once. On the AMBIENT benchmark, GPT-4 correctly disambiguated only 32% of cases versus 90% for humans — LLMs struggle to keep multiple live interpretations in play simultaneously Can language models recognize when text is deliberately ambiguous?. Negotiation is exactly this situation in disguise: each party's position is a separate interpretation of where the deal stands, and a model that collapses to one reading will quietly lose track of the other side's commitments. The same brittleness appears over time — models anchor on surface lexical cues and fail to adapt as a counterpart's strategy evolves across a multi-turn game Can models recognize how individuals reason differently?.
Game-theoretic studies sharpen the picture and hint at fixes. Left to themselves, LLMs deviate from rational strategy and get worse as games grow more complex — but wrapping them in a structured game-theoretic workflow steers reasoning back toward near-optimal, less exploitable negotiation Do language models make rational strategic decisions in games?. And the way a model tracks the other party isn't uniform: across 22 models, some reason by minimax, some by trust, and some by "belief-anticipation" — explicitly modeling what the opponent will do Do large language models use one reasoning style or many?. That belief-anticipation style is the closest native analog to commitment-tracking, and notably it's tied to model and game type rather than raw reasoning depth.
The quietly unsettling takeaway: an LLM doesn't carry a stable commitment ledger the way a human negotiator does. Shanahan's 20-questions regeneration test shows models hold a superposition of consistent possibilities and sample one at generation time rather than committing to a fixed state Do large language models actually commit to a single character?. So when a model appears to "remember" what each party agreed to, it may be re-improvising a consistent story each turn rather than maintaining one — which is why reliable multi-party tracking, in this corpus, comes from external scaffolding (explicit agreement state, RSA-style belief models, structured workflows) rather than from the model alone.
Sources 7 notes
Standard dialogue state tracking assumes one user's goals; negotiation requires explicit agreement from both parties across multiple issues. Existing DST models, limited to form-filling paradigms, cannot capture the strategic dynamics and mutual commitments essential to genuine bilateral agreement.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs struggle to anchor reasoning in temporal gameplay and adapt to evolving strategies. GPT-4o relies on surface lexical cues while DeepSeek-R1 shows early promise, but dynamic style adaptation remains largely insufficient across all models tested.
LLMs frequently fail to compute Nash equilibria, with worse performance as game complexity increases. Structured game-theoretic workflows guide reasoning toward optimal strategies, reducing exploitability and enabling near-optimal negotiation outcomes.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.