What causes language models' strategic rationality to decline with increased game complexity?
This explores why LLMs play games less rationally as the games get more complex — and the corpus suggests the real culprit may not be complexity at all, but unfamiliarity, missing scaffolding, and reasoning shortcuts that only look like rationality.
This explores why LLMs play games less rationally as the games get more complex. The obvious answer — that bigger game trees overwhelm the model's compute — turns out to be only half the story, and the more interesting half is that the decline may not be about complexity per se at all.
The surface phenomenon is well documented: models frequently fail to compute Nash equilibria, and their play drifts further from optimal as games grow Do language models make rational strategic decisions in games?. But a sharper diagnosis reframes the cause entirely — reasoning models break not at complexity *thresholds* but at *novelty* boundaries. They fit instance-based patterns rather than learning a generalizable algorithm, so a long, hard reasoning chain still succeeds if the model has seen similar instances, while a short, simple one fails if it's unfamiliar Do language models fail at reasoning due to complexity or novelty?. Under this view, 'complex' games decline because complexity correlates with unfamiliarity, not because the model runs out of strategic horsepower.
A second cause is that what looks like strategic reasoning is often a heuristic wearing reasoning's clothes. Most models actually perform *worse* when constraints are removed — they were defaulting to the harder, more conservative option rather than evaluating the situation, so stripping away the constraint that propped up that default exposes the absence of real reasoning Are models actually reasoning about constraints or just defaulting conservatively?. Complexity tends to add degrees of freedom that defeat such shortcuts, which is why rationality erodes exactly where the crutch disappears. Relatedly, different models lean on different fixed reasoning styles — minimax, trust-based, belief-anticipation — and performance tracks how well a style happens to fit the game's structure rather than raw reasoning depth Do large language models use one reasoning style or many?. A complex game that mismatches a model's native style will look like a complexity failure but is really a style failure.
The third cause is a memory-and-state problem. Strategic play in richer games demands tracking an evolving history and an opponent's shifting strategy, and models are bad at this without help: across even simple bandit environments, only GPT-4 *with* explicit prompting, chain-of-thought, and external history summarization explores competently — without summarization, models cannot aggregate unstructured interaction history into good decisions Why do LLMs struggle with exploration in simple decision tasks?. The same brittleness shows up in dynamic games, where models cling to surface lexical cues and fail to anchor reasoning in the temporal flow of play or adapt to an opponent who changes Can models recognize how individuals reason differently?. Complexity multiplies state to track, and that's where unaided in-context reasoning collapses.
The most useful takeaway: the decline is largely *fixable from the outside*. Structured game-theoretic workflows that scaffold the reasoning steps restore near-optimal play and reduce exploitability even on hard negotiations Do language models make rational strategic decisions in games?, and external summarization plus explicit exploratory hints rescue exploration Why do LLMs struggle with exploration in simple decision tasks?. That's the tell that the bottleneck isn't a missing capacity for strategy but a missing structure for deploying it — the model often *has* the rationality and fails to organize it on its own as the game gets bigger.
Sources 6 notes
LLMs frequently fail to compute Nash equilibria, with worse performance as game complexity increases. Structured game-theoretic workflows guide reasoning toward optimal strategies, reducing exploitability and enabling near-optimal negotiation outcomes.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.
Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.
LLMs struggle to anchor reasoning in temporal gameplay and adapt to evolving strategies. GPT-4o relies on surface lexical cues while DeepSeek-R1 shows early promise, but dynamic style adaptation remains largely insufficient across all models tested.