From the Archive — 2026-05-31 · 2026-05-31

Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?

Franka Bause, Jonas Niederle, Martin Pawelczyk, et al. · arXiv:2605.25929

This work frames multi-agent deliberation as a dynamical process where influence crystallizes around observable signals—confidence, alignment, perceived expertise—rather than actual competence, a reframing that invites comparison with how cognitive diversity and expertise interact in team problem-solving. The mixture-of-experts lens suggests that routing to competent agents should unlock multi-agent gains, but this hinges on a practical puzzle: since true competence remains latent, which proxies actually correlate with performance, and when do confident or aligned agents mislead the system? Recent work has documented how multi-agent advantages depend on the specific task and baseline model capability, raising the question of whether Friedkin-Johnsen dynamics could explain when deliberation helps versus hurts. If influence follows shallow signals like stated confidence rather than underlying accuracy, could coordination failures at scale partly reflect the wrong agents becoming influencers—and might better influence-detection mechanisms recover the promised multi-agent edge?

Adjacent research

Explore →

Can multi-agent teams automatically remove their weakest members? How do LLM debates differ from human expert consensus? Does cognitive diversity alone improve multi-agent ideation quality?

Go deeper into Agentic Systems and Planning→

Agentic Systems as Boosting Weak Reasoning Models

Varun Sunkaraneni, Pierfrancesco Beneventano, Riccardo Neumarker, et al. · arXiv:2605.14163

Recent work has documented a curious inversion: test-time search over weak models often demands diversity over raw capability, yet the mechanism by which committees of weaker reasoners converge on stronger answers remains incompletely understood. This paper dissects that mechanism with precision, formalizing why sheer proposal volume cannot alone rescue weak reasoning—what matters is whether a critic can *identify* correct solutions without the oracle knowledge that found them. That framing connects to a broader tension: ensemble consensus can exceed any individual expert, but only if the ensemble's training embeds the right inductive bias, whereas this work shows the selection stage itself must carry a local soundness signal (execution, proof checking, constraints) to amplify coverage into reliable performance. The empirical result—that a cheap nano model paired with a strong critic nearly matches expensive frontier models—is tantalizing, but the honest diagnosis is sobering: the remaining failures cluster in *shared blind spots* across all proposals, suggesting that no amount of better selection can overcome what the weak model pool never generates in the first place. What does it tell us about the structure of reasoning tasks when diverse weak proposals fail not in choosing among solutions, but in covering the solution space itself?

Adjacent research

Explore →

Can models trained on many imperfect experts outperform each one? Why does parallel reasoning outperform single chain thinking? Do iterative refinement methods suffer from overthinking?

Go deeper into Reasoning and Learning Architectures→

Scaling Laws for Agent Harnesses via Effective Feedback Compute

Xuanliang Zhang, Dingzirui Wang, Keyan Xu, et al. · arXiv:2605.29682

As agent systems become more complex, researchers are shifting focus from raw computational expenditure to the quality and durability of feedback loops — a move that mirrors broader questions about what information agent feedback actually carries and how to measure agent performance beyond one-shot success. This work introduces Effective Feedback Compute (EFC) as a trace-level metric that credits only informative, valid, and retained feedback, showing that harness performance is governed less by compute budget than by how efficiently that budget converts into task-sufficient interaction patterns. The findings raise a deeper tension: if feedback quality matters more than quantity, and if agent reliability emerges from externalizing cognitive work into memory and protocols rather than model capacity alone, then how should we redesign the feedback loops themselves — and does optimizing for feedback informativeness require rethinking what we mean by "tool use" in the first place?

Adjacent research

Explore →

Can architecture choices improve inference efficiency without sacrificing accuracy? Can scalar rewards capture all the information in agent feedback? What should we actually measure in agent evaluation?

Go deeper into Agentic Systems and Planning→

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Shanghua Gao, Ada Fang, Marinka Zitnik · arXiv:2605.28655

The tension between centralized coordination and distributed autonomy has long shadowed multi-agent research systems—most prior work either locks agents into a single research path or relies on a planner with fixed goals, limiting their ability to learn from failure and explore in parallel. This paper joins a growing conversation about whether research domains suitable for autonomous optimization benefit from decentralized team structures that can self-organize around evidence, and the results suggest they do: agents that critique proposals before spending compute and share both successes and failures outperform single-agent baselines across biomedical, protein, and language-model domains. The deeper question here isn't whether teams beat individuals—it's whether the same computational resources that accelerate discovery also reshape how agents should organize to spend those resources wisely. Yet this raises a companion puzzle: if decentralized teams succeed by preserving institutional memory of failed directions and dynamically forming hypotheses around emerging evidence, what role remains for human researchers in the loop, and at what point does the speed of agent consensus outpace the speed of actual scientific insight?

Adjacent research

Explore →

What makes a research domain suitable for autonomous optimization? Can careful selection of 78 demos outperform massive training datasets? Can computational power accelerate scientific discovery itself?

Go deeper into Agentic Systems and Planning→

Understanding and Mitigating Premature Confidence for Better LLM Reasoning

Jingchu Gai, Guanning Zeng, Christina Baek, et al. · arXiv:2605.24396

Recent work has shown that longer reasoning chains often mask rather than solve underlying quality problems, and researchers have increasingly turned to model confidence itself as a reward signal rather than relying on expensive external verifiers. This paper's diagnosis of premature confidence—the tendency to lock in an answer early and then rationalize it—names a specific failure mode that existing scaling approaches miss, and the proposed solution works without step-level annotations by simply asking models to update their confidence *during* reasoning. The tension here is intriguing: if confidence patterns can guide reasoning quality, what does it mean that confidence signals appear to encode deeper trade-offs between overthinking and underthinking, and might forcing gradual confidence growth simply shift the problem rather than solve it? Whether this method succeeds because it addresses a root cause or because it encourages longer, more measured deliberation—and whether those are even distinguishable—remains an open question worth testing across more diverse domains.

Adjacent research

Explore →

Can reasoning and answers be generated separately in language models? Can model confidence alone replace external answer verification? Does extended thinking actually improve reasoning or just increase variance?

Go deeper into Reasoning and Knowledge→

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, et al. · arXiv:2605.27295

As modality competition in multimodal models becomes increasingly understood as a design problem, the promise of unified embedding spaces across video, audio, image, and text naturally raises the question of whether architectural choices—rather than raw model scale—determine how well these modalities coexist in a single vector space. The work also sits at an intersection with questions about temporal alignment in video retrieval tasks, since embedding video without explicit temporal structure may smooth over crucial temporal relationships that downstream applications like RAG need to resolve. Perhaps more fundamentally, claims of strong zero-shot transfer across specialized domains invite scrutiny of what geometric limits single-vector embeddings face when representing diverse combinations of modalities—can a unified representation truly capture the semantic density needed across astronomy, cuisine, and code, or does unification require trade-offs that only emerge at application time?

Adjacent research

Explore →

Can we solve modality competition through architectural design? Why do decoder-only models underperform as text encoders? Why do vision and language scale so differently?

Go deeper into Reasoning and Knowledge→