Multi-agent cooperation through in-context co-player inference
Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between “learningaware” agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between “naive learners” updating on fast timescales and “meta-learners” observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale.
Introduction. The development of foundation model agents is rapidly shifting the landscape of artificial intelligence from isolated systems to interacting autonomous agents (Aguera Y Arcas et al., 2026; Park et al., 2023; Xi et al., 2023). As these sequence-model-based agents are deployed in increasingly complex environments, they inevitably face multi-agent interactions where outcomes depend on interactions of multiple entities. Because these interactions frequently involve competing goals, ensuring that selfinterested agents robustly cooperate in mixed-motive settings remains an important open challenge, even as individual agent capabilities have grown significantly. Decentralized Multi-Agent Reinforcement Learning (MARL) addresses the problem of learning to interact with other agents while only having access to local observations. However, decentralized MARL is challenging due to two primary factors: equilibrium selection and non-stationarity of the environment (Hernandez-Leal et al., 2017; Shoham & Leyton-Brown, 2008).
Discussion / Conclusion. In this work, we have demonstrated that the complex machinery of explicit co-player learningawareness—such as meta gradients or rigid timescale separation—is not required to learn cooperative behaviors in general-sum games. Instead, we found that simply training agents against a diverse distribution of co-players suffices to induce in-context best-response strategies. This in-context learning renders agents susceptible to shaping and consequently driving them toward cooperative behaviors through mutual extortion dynamics. Crucially, this result bridges the gap between multi-agent reinforcement learning and the training paradigms of modern foundation models. Since foundation models naturally exhibit in-context learning and are trained on diverse tasks and behaviors, our findings suggest a scalable and computationally efficient path for the emergence of cooperative social behaviors using standard decentralized learning techniques.