Can models adapt and combine search strategies beyond their training algorithm?

This explores whether a model can do more than execute the fixed search procedure it was trained on — whether it can pick, blend, adapt, or even invent search strategies on the fly.

This explores whether a model can do more than execute the fixed search procedure it was trained on — whether it can pick, blend, adapt, or even invent search strategies. The corpus is surprisingly optimistic, but with a sharp caveat at the end.

The strongest case for adaptation comes from work that treats search itself as something a model learns rather than something hard-coded around it. Training on full, messy search traces — including the wrong turns and backtracking — produces models that build an internal world model of searching and improvise adaptive strategies, beating models trained only on clean optimal answers Does training on messy search processes improve reasoning?. Push that further and you can train on linearized traces of actual algorithms like MCTS and A*, and the model internalizes the algorithm rather than the answer — which means it can then optimize over search strategies themselves, potentially reaching novel ones Can models learn to internalize search algorithms through training?. So 'beyond the training algorithm' isn't a contradiction: the point of training on the process is to free the model from any single fixed procedure.

The boldest answer is a system that rewrites its own search code. A bilevel 'autoresearch' loop reads its inner search mechanism, spots bottlenecks, and writes new Python at runtime — discovering combinatorial-optimization and bandit methods that broke its original deterministic patterns and delivered a 5x gain Can an AI system improve its own search methods automatically?. That's combining and inventing strategies in the most literal sense. Quieter versions of the same idea appear at inference time: evolutionary search uses the model to generate its own mutations and crossovers, sustaining diversity to avoid the dead-ends that simple resampling falls into Can evolutionary search beat sampling and revision at inference time?, and swarms of model 'particles' move through weight space to compose experts that answer questions none of the starting models could — with no gradient training at all Can language models discover new expertise through collaborative weight search?. Adaptation here lives in how models are combined, not in any one model's weights.

There's a parallel thread on adapting which skills to deploy rather than which search to run. Models can compose task-specific expert vectors at inference, dynamically mixing them per problem without retraining Can models dynamically activate expert skills at inference time?, and self-play setups generate their own curriculum of problems and verify their own answers, improving without any external data or fixed target Can language models improve themselves without any external training data?. Even cheap, weightless adaptation works: agents that store written reflections on their failures in episodic memory get better across attempts without a single parameter update Can agents learn from failure without updating their weights?, and tree search can manufacture its own quality signal in place of human feedback Can tree search replace human feedback in LLM training?.

Here's the doorway you might not expect: a lot of apparent 'adaptation' is an illusion of memorization. RL fine-tuning often sharpens template-matching rather than installing a real procedure — models that look like they learned to optimize collapse on out-of-distribution variants of the same task Do fine-tuned language models actually learn optimization procedures?. And on genuine constrained-optimization problems, models plateau around 55–60% regardless of scale, architecture, or training regime, which reads as a ceiling rather than a gap waiting for more compute Do larger language models solve constrained optimization better?. So the corpus splits cleanly: when search is made explicit — trained on as a process, evolved at inference, or rewritten by an outer loop — models genuinely combine and extend strategies. When you just fine-tune and hope the strategy generalizes, you often get a memorized template wearing the costume of a search algorithm.

Sources 11 notes

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Can models learn to internalize search algorithms through training?

Meta-CoT demonstrates that instruction-tuning on linearized MCTS and A* traces teaches models to implement search strategies internally. This enables optimization over algorithms themselves rather than specific outputs, potentially unlocking novel reasoning strategies.

Can an AI system improve its own search methods automatically?

An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.

Can evolutionary search beat sampling and revision at inference time?

Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can models adapt and combine search strategies beyond their training algorithm?

Sources 11 notes

Next inquiring lines