Can latent recurrence and energy minimization both escape the same computational depth constraints?

This explores whether two very different inference tricks — looping a model's hidden state recurrently, and running gradient descent to minimize an energy score — are both ways around the same wall: a fixed-depth transformer can only do a bounded amount of sequential computation per token.

This reads the question as asking whether latent recurrence and energy minimization are two routes around the *same* obstacle — the fact that a standard transformer has a fixed number of layers, so it can only perform a bounded amount of step-by-step reasoning before it must emit an answer. Theory pins this down: fixed-depth transformers sit inside complexity classes like AC0/TC0, which means there are problems they simply cannot solve no matter how wide or well-trained they are. The corpus has two camps attacking that ceiling from opposite directions, and the interesting answer is that they escape it in genuinely different ways.

The recurrence camp adds depth by looping. The Hierarchical Reasoning Model couples a slow planning loop with a fast computation loop and runs them across timescales, and the headline claim is precisely that this lets a 27M-parameter model 'escape the AC0/TC0 complexity ceiling' to solve Sudoku and mazes that chain-of-thought transformers fail on completely Can recurrent hierarchies achieve reasoning that transformers cannot?. Recurrence turns a fixed stack of layers into an unrolled-as-far-as-you-want computation. Related work shows you can make that loop *stochastic* rather than deterministic, so the model holds a distribution over solutions instead of committing early Can stochastic latent reasoning help models explore multiple solutions?, and even scale it sideways by sampling parallel latent trajectories instead of only deeper ones Can reasoning systems scale wider instead of only deeper?. The shared idea across these: effective depth is decoupled from architectural depth.

The energy camp gets there differently. Energy-Based Transformers don't loop a hidden state forward — they assign an energy score to each input-prediction pair and then *minimize* that energy by gradient descent at inference time Can energy minimization unlock reasoning without domain-specific training?. Each optimization step is an extra increment of computation the fixed forward pass didn't have, and crucially the model decides how many steps to spend, getting 29% more out of inference compute and generalizing better out-of-distribution. So the answer to the literal question is: yes, both add effective depth the base transformer lacks — but recurrence does it by *unrolling a learned transition*, while energy minimization does it by *descending a learned landscape*. One is iterate-the-state; the other is optimize-against-a-score.

That distinction matters because of a cautionary note in the corpus: LLMs asked to perform iterative optimization in latent space mostly *don't* — they pattern-match memorized solution templates and emit plausible but wrong values, a failure that survives scaling Do large language models actually perform iterative optimization?. Both recurrence and energy methods are, in effect, ways of *forcing* genuine iteration into a system that otherwise fakes it. Energy minimization is arguably the more honest version, because the gradient steps are real optimization with a measurable objective, not a learned shortcut hoping to look like optimization.

There's a third framing worth knowing you wanted: not all extra depth has to be spent on the current token. Some of this compute can go into *consolidation* — recurrent passes that transform context into fast weights offline, the way the long-context bottleneck turns out to be compute-to-consolidate rather than memory capacity Is long-context bottleneck really about memory or compute?, Can recurrence consolidate memory without predicting tokens?. And latent-thought approaches treat the depth budget as its own scaling axis with fast inner-loop and slow outer-loop learning Can latent thought vectors scale language models beyond parameters?. The unifying takeaway: 'computational depth' is becoming a resource you allocate — by looping, by optimizing, or by consolidating — rather than a number frozen into the architecture.

Sources 8 notes

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can recurrence consolidate memory without predicting tokens?

Language models can use recurrent passes without input tokens to transfer recent context into persistent fast weights via learned local rules, mirroring hippocampal replay during biological sleep. This separates consolidation from prediction, enabling different scheduling and compute allocation.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can latent recurrence and energy minimization both escape the same computational depth constraints?

Sources 8 notes

Next inquiring lines