How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
This explores two different ways a model can 'think' at inference time — running optimization steps (gradient descent on an energy landscape) versus generating a chain of intermediate text tokens — and what the corpus says about how those two routes to extra compute differ.
This explores two different ways a model can 'think' at inference time: running optimization steps (gradient descent that minimizes an energy score) versus generating chain-of-thought (CoT) text tokens. They look superficially alike — both spend extra compute at test time to get a better answer — but the corpus suggests they're doing fundamentally different things under the hood.
The gradient-descent route shows up in Energy-Based Transformers Can energy minimization unlock reasoning without domain-specific training?, which assign an energy value to each input–prediction pair and then iteratively descend that energy surface at inference until the prediction settles. The striking part is that this 'System 2' behavior emerges from unsupervised learning alone — no math datasets, no reasoning traces, no domain scaffolding — and it generalizes better on out-of-distribution data. Each iteration is a genuine optimization step toward a lower-energy answer.
CoT, by contrast, doesn't optimize anything continuous — it generates tokens that look like reasoning. A cluster of corpus notes argues that what CoT produces is constrained imitation of reasoning *form*, not genuine inference What makes chain-of-thought reasoning actually work? Does chain-of-thought reasoning reveal genuine inference or pattern matching?. Format and spatial layout shape the output far more than logical content — invalid prompts can work as well as valid ones, and demo position can swing accuracy 20% What makes chain-of-thought reasoning actually work?. And the imitation shows: CoT degrades predictably once you push it outside the training distribution Does chain-of-thought reasoning actually generalize beyond training data?, which is exactly the weakness the energy-descent approach claims to handle better. This is the sharpest contrast — descending an energy landscape is a real search procedure that can keep working off-distribution, whereas a CoT chain is recalling a familiar schema and breaks when no schema fits.
That difference also reframes what 'more steps' means. With gradient descent, more iterations is straightforwardly more optimization. With CoT, more tokens is *not* reliably more thinking: accuracy follows an inverted-U where chains that are too long hurt, and stronger models prefer shorter ones Why does chain of thought accuracy eventually decline with length?. Most CoT tokens turn out to be documentation rather than computation — Chain of Draft matches full CoT at 7.6% of the tokens Can minimal reasoning chains match full explanations?, and dynamic pruning can cut 75% of steps with no accuracy loss because verification and backtracking steps barely get attended to downstream Can reasoning steps be dynamically pruned without losing accuracy?. Trace length even decouples from problem difficulty out-of-distribution, tracking schema-recall instead of adaptive effort Does longer reasoning actually mean harder problems?. So CoT 'iterations' are a noisy, partly cosmetic signal, while energy iterations are load-bearing by construction.
The thing you might not have expected: the corpus suggests the *amount* of inference compute matters less than the mechanism that makes it productive. Non-reasoning models can't close the gap with reasoning models even given unlimited inference budget, because the payoff comes from a training-instilled protocol, not raw token count Can non-reasoning models catch up with more compute?. And test-time compute pays off more when it's spent on structured breadth — diverse abstractions — than on simply extending a single chain deeper Can abstractions guide exploration better than depth alone?. Both findings point the same way the energy-descent work does: a principled optimization-or-search structure over inference compute beats just generating more sequential text.
Sources 11 notes
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.