Mindstorms in Natural Language-Based Societies of Mind
Both Minsky’s “society of mind” and Schmidhuber’s “learning to think” inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a “mindstorm.” Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents—all communicating through the same universal symbolic language—are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents—some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence.
Introduction. UMAN society is composed of countless individuals living together, each acting according to their objectives but each fulfilling different specialized roles. In the 1980s, Marvin Minsky built on this idea to explain intelligence and coined the expression “society of mind” (SOM) [1], where intelligence emerges through computational modules that communicate and cooperate with each other to achieve goals that are unachievable by any single module alone. In principle, any standard artificial neural network (NN) consisting of numerous connected simple neurons could be regarded as a SOM. In the 1980s and 90s, however, more structured SOMs emerged, consisting of several NNs trained in different ways which interacted with one another in a predefined manner [2]. For example, one NN may be trained to execute reward-maximizing action sequences in an environment, and another NN may learn to predict the environmental consequences of these actions [3]–[9] [10, Sec. 6.1].
Discussion / Conclusion. Recurrent neural network (RNN) architectures have existed since the 1920s [71], [72]. RNNs can be viewed as primitive societies of mind (SOMs) consisting of very simple agents (neurons) that exchange information and collectively solve tasks unsolvable by single neurons. However, it was only in the 1980s that more structured SOMs composed of several interacting artificial neural networks (NNs) trained in different ways emerged [2], [3], [6], [11], [18] [10, Sec. 6.1]. In these SOMs, strict communication protocols allow certain NNs to help other NNs solve given tasks. In the less strict, more general setting from 2015’s learning to think [28], NNs are allowed to learn to interview other NNs through sequences of vector-based queries or prompts via a general communication interface that allows for extracting arbitrary algorithmic information from NNs, to facilitate downstream problem-solving.