AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AUTOSCIENTISTS, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AUTOSCIENTISTS improves over prior AI agents across biomedical machine learning, languagemodel training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AUTOSCIENTISTS achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%.
Introduction. AI agents for science are beginning to move beyond answering questions and running predefined workflows toward proposing and executing research steps [1], from protein engineering in biology to language model optimization in machine learning [2, 3]. Agents can generate hypotheses, synthesize literature, design computational experiments, write and execute code, and refine models from experimental feedback [4–10]. However, most current approaches remain limited to short-horizon optimization or fixed pipelines. They typically follow a single reasoning thread or use a search-space decomposition set at the start of the run. This assumption breaks down in long-running scientific experimentation, where research directions are not known in advance and change over time. Existing AI agents can run experiments, but long-running science requires more: maintaining competing hypotheses, updating them as evidence changes, and using failures to redirect the search.
Discussion / Conclusion. AUTOSCIENTISTS is not designed to be more LLM-call efficient than single-agent baselines. As shown in Table S8, AUTOSCIENTISTS uses more LLM tokens than Autoresearch, though within the same order of magnitude, reflecting its use of multiple agents for parallel reasoning, discussion, and team reorganization. Instead, AUTOSCIENTISTS is designed to improve experimental search under a fixed experimental-compute budget by enabling teams of agents to explore and collaborate over the design space. Under a fixed experimental-compute budget, this approach achieves better performance than existing methods as shown in Figures 3 and 4. As part of the matched experimentalcompute budget used for fair comparison on BioML-Bench, we restricted AUTOSCIENTISTS to one H100 GPU per task so GPU-bound experiments in that evaluation were executed sequentially. This setting evaluates experiment selection under matched compute but does not fully exercise AUTOSCIENTISTS’s capacity for parallel experimentation.