Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?

This explores whether letting a model 'think' by minimizing an energy score over candidate answers (energy-based inference) can substitute for the reinforcement-learning recipes we currently use to train System 2 reasoning into models.

This explores whether energy minimization — scoring how well a prediction fits an input and using gradient descent at inference time to settle on the best fit — can stand in for the reinforcement-learning pipelines that usually instill deliberate, System 2 reasoning. The corpus's most direct answer is encouraging: Energy-Based Transformers learn System 2 thinking from plain unsupervised learning, with no reward signals, no verifiers, and no domain-specific scaffolding, while scaling faster during training and generalizing better out-of-distribution Can energy minimization unlock reasoning without domain-specific training?. The mechanism is appealingly general: instead of training a model to follow a learned reasoning protocol, you let inference itself become an optimization that 'thinks longer' by minimizing energy.

But whether this *replaces* RL depends on what you believe RL is actually doing — and here the corpus complicates the picture. A recurring finding is that reasoning is already latent in base models, and post-training merely *selects* or *elicits* it rather than creating it Do base models already contain hidden reasoning ability?. RL fits that framing: it updates only 5–30% of parameters in stable, structured subnetworks Does reinforcement learning update only a small fraction of parameters?, suggesting it's a surgical elicitation step, not wholesale capability-building. If reasoning is mostly latent, then energy minimization is just one of several elicitation routes — alongside training-free methods like modular cognitive tools, which lifted GPT-4.1 on competition math with no RL at all Can modular cognitive tools unlock reasoning without training?, and activation steering that reshapes reasoning without retraining Can we steer reasoning toward brevity without retraining?.

The catch is that not all of what RL contributes looks like something inference-time optimization can recover. One study finds non-reasoning models can't catch up to reasoning models no matter how much inference compute you throw at them, because training instills a *protocol* that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. That's a pointed challenge to the energy thesis: if the bottleneck is a learned deployment mechanism rather than raw search, then minimizing energy harder may hit the same ceiling. RL training also shows a two-phase structure — first mastering execution, then strategic planning Does RL training follow a predictable two-phase learning sequence? — and it's unclear whether undirected energy descent discovers that planning layer on its own.

There's also the question of what RL uniquely *teaches*. Several notes argue RL's value is in the richness of its signal, not merely in inducing search: natural-language critiques break plateaus that numerical rewards can't Can natural language feedback overcome numerical reward plateaus?, RL embeds domain knowledge more durably than supervised fine-tuning by rewarding explanation quality Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?, and simple accuracy rewards can act as an 'emergence engine' that grows complex domain reasoning Can simple rewards alone teach complex domain reasoning?. Energy minimization, being unsupervised, brings none of that task-specific pressure — so it may elicit *general* deliberation while still leaving domain competence to some form of feedback training.

The more interesting reframe the corpus offers is that 'replace' may be the wrong verb. Reasoning is increasingly something you can plant earlier (chain-of-thought baked into pretraining via information-gain rewards Can chain-of-thought reasoning be learned during pretraining itself?) or shape at test time by allocating compute across diverse abstractions rather than deeper single chains Can abstractions guide exploration better than depth alone?. Energy-based inference fits naturally into that test-time-compute family. The honest read: energy minimization is a genuine, RL-free path to System 2 *behavior* and a strong candidate for the elicitation role — but the corpus gives no evidence it absorbs the protocol-instilling and domain-shaping functions that make RL more than a search trigger. The likely future is layered, not substitutive.

Sources 12 notes

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?

Sources 12 notes

Next inquiring lines