What scaling behavior do partial systems show without iterative query refinement?
This explores how search-and-retrieval systems scale when you just give them more compute — more search steps, more parallel tries — rather than letting them iteratively rewrite their queries, and what that reveals about where extra effort pays off and where it hits a wall.
This reads as a question about what happens when you scale a system by brute force — more search budget, more parallel reasoning — instead of by making it smarter about reformulating what it's looking for. The corpus has a surprisingly clean answer, and it cuts both ways.
The encouraging half: search behaves like a compute dial. Agentic deep research shows that the number of search steps follows almost exactly the same scaling curve as reasoning tokens — pour in more retrieval and answer quality climbs, then flattens into diminishing returns Does search budget scale like reasoning tokens for answer quality?. This reframes search not as a fixed lookup but as a knob you can trade against reasoning, a genuine inference-compute axis How does search scale like reasoning in agent systems?. And the scaling doesn't have to go deeper to pay off — it can go wider. Sampling many parallel paths through the solution space matches the benefits of longer serial chains without paying the latency cost of depth Can reasoning systems scale wider instead of only deeper?. So a 'partial' system left to grind without query refinement still improves with scale, just along a predictable, eventually-flattening curve.
The sobering half: scale hits ceilings that no amount of budget moves. On constrained-optimization tasks, LLMs converge to roughly 55–60% constraint satisfaction regardless of parameter count, architecture, or training regime — reasoning models don't systematically beat standard ones, which points to a fundamental wall rather than a scaling gap Do larger language models solve constrained optimization better?. Frontier reasoning models stall at 20–23% on problems that demand genuine backtracking, even though they sound fluent while doing it Can reasoning models actually sustain long-chain reflection?. The reason is architectural: autoregressive generation can't retract a token it has already emitted, while solving these problems requires discarding bad partial attempts and trying again — exactly the move scaling alone can't supply Why does autoregressive generation fail at constraint satisfaction?.
Put together, the surprise is that 'more compute' and 'better querying' fix different things. Throwing search budget at a task buys you the smooth scaling curve, but where retrieval fails it tends to fail structurally — wrong trigger timing, embeddings that measure association rather than relevance, hard mathematical limits on what a vector can represent — and those are not problems you tune your way out of by scaling Where do retrieval systems fail and why?. The thing you didn't know you wanted to know: a system without iterative refinement can look like it's improving right up until it isn't, because the scaling curve and the architectural ceiling are two separate phenomena stacked on top of each other.
Sources 7 notes
Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.
Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
DeepSeek-R1 and o1-preview achieve only 20-23.6% exact match on 850 constraint satisfaction problems requiring genuine backtracking. This ceiling reveals that reflective reasoning fluency does not translate to actual problem-solving competence on unfamiliar instance structures.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.