TOPIC

Reasoning Architectures

35 synthesis notes · 60 source papers
View as

Can tiny recursive networks outperform massive language models?

Does a small network that refines its reasoning through recursion on a latent state actually generalize better than billion-parameter LLMs on hard puzzles like ARC-AGI? What makes recursion more powerful than scale?

Explore related Read →

Does planning direction affect how hard problems become?

Planning research typically goes forward only. But some problems get easier when you work backward from the goal. What makes direction matter, and can language models exploit this?

Explore related Read →

Do base models already contain hidden reasoning ability?

Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.

Explore related Read →

Can modular cognitive tools unlock reasoning without training?

Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?

Explore related Read →

Does chain of thought reasoning actually explain model decisions?

When language models show their reasoning steps in agentic pipelines, does the quality of those steps predict or explain the quality of final outputs? This matters for trusting and debugging AI systems.

Explore related Read →

Can a single problem unlock reasoning through solution critique?

Does exposing models to diverse critiques of different solutions to one problem activate reasoning as effectively as training on many problems? This tests whether solution diversity matters more than problem diversity.

Explore related Read →

Can reasoning and tool execution be truly decoupled?

Can LLM reasoning be separated from tool observations to eliminate redundant re-prompting and enable parallel execution? Two recent architectures suggest yes, but what are the tradeoffs?

Explore related Read →

Can interleaving reasoning with real-world feedback prevent hallucination?

Does grounding language model reasoning in external world observations rather than internal associations help prevent error propagation and false outputs? This explores whether breaking the static chain-of-thought pattern can catch and correct mistakes in real time.

Explore related Read →

Can models reason without generating visible thinking tokens?

Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.

Explore related Read →

Can structured debate roles help small models detect ambiguity?

Small language models struggle to recognize when problems are underspecified. Can assigning explicit leader-follower roles in multi-agent debates overcome this limitation and boost ambiguity detection accuracy?

Explore related Read →

Do large language models actually perform iterative optimization?

Explores whether LLMs execute genuine numerical procedures like Newton-Raphson or instead pattern-match to memorized solution templates when solving constrained optimization problems.

Explore related Read →

Why do LLMs struggle with exploration in simple decision tasks?

This explores why large language models fail at exploration—a core decision-making capability—even when they excel at other tasks, and what specific conditions might help them succeed.

Explore related Read →

Do larger language models solve constrained optimization better?

Explores whether scaling LLMs—through more parameters, better training, or reasoning extensions—improves their ability to satisfy constraints in real optimization problems like power grids and portfolios.

Explore related Read →

How do looped transformer layers actually behave during inference?

When language models loop their layers to improve reasoning, do they discover new computations or repeat existing ones? Understanding the internal dynamics could explain why recurrent architectures outperform simple depth scaling.

Explore related Read →

Can stochastic latent reasoning help models explore multiple solutions?

This explores whether making recursive reasoning paths probabilistic rather than deterministic lets models maintain uncertainty and consider alternative hypotheses when problems admit multiple valid solutions.

Explore related Read →

Do fine-tuned language models actually learn optimization procedures?

Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.

Explore related Read →

Why do outcome-based reward models fail at intermediate step evaluation?

Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.

Explore related Read →

Which tokens in reasoning chains actually matter most?

Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.

Explore related Read →

Do reasoning cycles in hidden states reveal aha moments?

What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.

Explore related Read →

Do reasoning models actually beat standard models on optimization?

Explores whether extended chain-of-thought in reasoning models delivers performance gains on constraint-satisfaction problems like power-grid optimization. Matters because reasoning models are treated as automatic upgrades, but the evidence may not support that claim.

Explore related Read →

Can reasoning systems scale wider instead of only deeper?

Explores whether sampling multiple parallel latent trajectories offers a faster scaling path than recursive refinement alone. Matters because it could unlock latency-efficient reasoning at test time.

Explore related Read →

Can models reason without generating visible thinking steps?

Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.

Explore related Read →

Can curriculum learning approximate expensive process supervision?

Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?

Explore related Read →

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Explore related Read →

When does RL actually extend reasoning beyond pretraining?

Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.

Explore related Read →

Why do RL agents stop asking informative questions?

RL-trained agents often fail to seek information effectively, despite being trained to do so. Understanding whether this reflects a capability gap or a training dynamics problem could reveal how to unlock better information-seeking behavior.

Explore related Read →

Does separating planning from execution improve reasoning accuracy?

Can modular LM architectures that split problem decomposition from solution execution outperform monolithic models? This explores whether decoupling these cognitive operations reduces interference and boosts performance.

Explore related Read →

Does supervised fine-tuning actually improve reasoning on optimization problems?

When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?

Explore related Read →

Can symbolic solvers fix how LLMs reason about logic?

LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?

Explore related Read →

Does chain-of-thought reasoning actually explain AI decisions?

Chain-of-thought is pitched as a transparency tool for agentic AI, but empirical evidence raises questions about whether reasoning chains actually predict or explain the system's outputs in practice.

Explore related Read →

Does adding randomness to recursive models actually help reasoning?

GRAM's ablations test whether stochasticity alone improves recursive architectures, or whether the gains depend on a specific training framework. This matters because it separates surface mechanisms from the methods that make them work.

Explore related Read →

Should LLMs handle abstraction only in optimization?

What if LLMs worked exclusively on translating problems to formal constraints, while deterministic solvers handled the numeric work? Explores whether this division of labor could overcome LLM failures in iterative computation.

Explore related Read →

Does RL post-training create reasoning or just deploy it?

Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.

Explore related Read →

Can backward reasoning during training improve forward reasoning?

Does training models to reason backward—generating inverse questions and solutions—build internal consistency checking that transfers to forward-only inference? This explores whether backward capacity internalized during training without test-time deployment can enhance reasoning quality.

Explore related Read →

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Explore related Read →

Source papers 60

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.