Can token probability distributions extend swarm composition across different model architectures?

This explores whether composing models in their output space — the probability distribution over next tokens — could let a 'swarm' span different architectures, where composing in weight space cannot.

This reads the question as a contrast between two places you can blend models: their *weights* and their *outputs*. The corpus's clearest swarm result lives in weight space — Can language models discover new expertise through collaborative weight search? sends PSO-style 'particles' (each an LLM) drifting through a shared weight landscape until they settle on composed experts that can answer questions every starting model failed. That trick is powerful but quietly architecture-locked: averaging or interpolating weights only makes sense when all the models share the same coordinate system. Two different architectures don't have comparable weights to move through together. So weight-space swarms hit a wall the moment you want to mix, say, a small dense model with a larger one.

Token probability distributions sidestep exactly that constraint, because every model — regardless of its internals — emits a distribution over the *same* vocabulary. Output space is the common ground that weight space isn't. Nothing in the corpus demonstrates a distribution-level swarm across architectures directly, so this is a synthesis rather than a reported finding; but the pieces line up. Inference-time composition already works without touching weights: Can evolutionary search beat sampling and revision at inference time? runs a diversity-preserving population of candidate solutions with LLM-generated mutations and crossovers, and How does test-time scaling work at the agent level? frames multi-agent gains as something you buy at the output/coordination layer rather than inside any single model. These are swarms whose 'genome' is text and choices, not weights — which is precisely what makes them indifferent to what produced them.

There's also a reason mixing architectures might be worth the trouble rather than just possible. Do large language models use one reasoning style or many? finds that different models reason in genuinely distinct styles — one minimaxes, another reasons from trust, another anticipates beliefs. A distribution-level swarm could blend those complementary tendencies in a way a homogeneous weight swarm never could, because the diversity is baked into different architectures, not into different points in one model's landscape. The economic case echoes this: Can small language models handle most agent tasks? argues the rational design is heterogeneous by default — small models everywhere, large ones selectively — which only works if you can compose across the boundary.

The subtler payoff is *where* such composition would actually bite. Do high-entropy tokens drive reasoning model improvements? shows that only about 20% of token positions — the high-entropy forking points — carry the real decision weight; the rest are near-deterministic. That suggests a distribution-space swarm wouldn't need to negotiate every token across architectures. It would only need agreement (or productive disagreement) at the handful of pivotal branch points, which is both cheaper and more tractable than blending entire weight matrices. The catch worth keeping in view, from Does token spending drive multi-agent research performance?: a lot of multi-agent benefit is just token spend, so the open question is whether cross-architecture distribution mixing adds *coordination* value beyond simply sampling more.

So the honest answer is: plausibly yes, and for a clean reason — output distributions are the architecture-agnostic interface that weights aren't, the diversity of architectures is an asset rather than noise, and you'd only have to compose at the ~20% of tokens that matter. But the corpus shows the ingredients, not the finished dish; no note here builds the distribution-level cross-architecture swarm itself.

Sources 7 notes

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does token spending drive multi-agent research performance?

Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, investigate whether token probability distributions can enable swarm composition across architectures—treating this as still-open despite recent multi-agent work.

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2026; treat these as historical snapshots, not present capability:
• Weight-space swarms (PSO-style particle composition) work *within* a single architecture but fail across mismatched models because weights occupy incomparable coordinate systems (2024-10, arXiv:2410.11163).
• Token probability distributions—the vocabulary-level outputs all models share regardless of internals—offer an architecture-agnostic interface, suggesting cross-architecture blending is *possible* at inference time (synthesis from 2025-02, 2025-06, 2025-07).
• Only ~20% of tokens (high-entropy forking points) carry real decision weight; most are near-deterministic, implying cross-architecture distribution swarms need only negotiate pivotal branch points, not full matrices (2026-01, arXiv:2601.03066).
• Multi-agent gains often reduce to token-spending effects; the open question is whether cross-architecture coordination adds *strategic* value beyond sampling more (2025-11, arXiv:2512.02038).
• Heterogeneous model design (small + large) is economically rational but only works if composition across architectural boundaries is tractable (2025-06, arXiv:2506.02153).

Anchor papers (verify; mind their dates):
• arXiv:2410.11163 (2024-10): Model Swarms—weight-space PSO, architecture-locked.
• arXiv:2601.03066 (2026-01): Functional importance of reasoning tokens—identifies the ~20% signal.
• arXiv:2506.02153 (2025-06): Small LMs as agentic foundation—heterogeneous design case.
• arXiv:2512.02038 (2025-11): Deep Research survey—token-cost primacy in multi-agent systems.

Your task:
(1) RE-TEST THE ARCHITECTURE BARRIER. Has newer work (last 6 months) demonstrated or refuted cross-architecture weight interpolation? Check whether recent SAE interpretability, distillation methods, or adapter layers have relaxed the coordinate-system incompatibility. Separately, has any paper *actually built* a distribution-level swarm across architectures, or does the constraint still stand? Distinguish durable (output-space composition is harder than it looks) from perishable (we now have the engineering).
(2) Surface work from ~mid-2026 onward that contradicts the heterogeneous-model thesis or shows single-agent approaches outperform cross-architecture swarms on reasoning tasks.
(3) Propose 2 research questions: (a) Do high-entropy tokens from different architectures cluster in the same regions of probability space, or are they architecture-specific? (b) What is the minimal token-budget cost to achieve parity with a single large model using a cross-architecture distribution-weighted ensemble?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can token probability distributions extend swarm composition across different model architectures?

Sources 7 notes

Next inquiring lines