How should test-time scaling methods be categorized and designed?

A navigation hub organizing test-time scaling architectures, training methods, and novel inference-time approaches beyond standard token scaling.

Topic Hub · 25 linked notes · 4 sections

View as

Architecture and Training

6 notes

How do internal and external test-time scaling compare?

Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.

Does policy entropy collapse limit reasoning performance in RL?

As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?

Do critique models improve diversity during training itself?

Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.

What makes test-time training actually work in practice?

Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.

Can models learn to internalize search algorithms through training?

Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.

Can architecture choices improve inference efficiency without sacrificing accuracy?

Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.

Novel Directions

10 notes

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

When should AI systems do their thinking?

Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.

Can storing evolved thoughts prevent inconsistent reasoning in conversations?

When LLMs repeatedly reason over the same conversation history for different questions, they produce inconsistent results. Can storing pre-reasoned thoughts instead of raw history solve this problem?

Can models improve themselves using only majority voting?

Explores whether test-time reinforcement learning can generate effective reward signals from unlabeled data by treating majority-voted answers as pseudo-labels, and whether this bootstrapping approach actually drives meaningful policy improvement.

Can models learn to evaluate their own work during training?

Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.

Can generative and discriminative models reach agreement?

Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

How can models select the most informative question to ask?

Explores whether simulating possible futures and scoring questions by information gain can identify which clarifying question would best reduce uncertainty—moving beyond just deciding whether to ask toward deciding what to ask.

Can models treat long prompts as external code environments?

Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?

Does prompt optimization without inference strategy fail?

Standard practice optimizes prompts and inference strategies separately. But do prompts optimized for single-shot evaluation actually perform worse when deployed at scale with aggregation methods like majority voting?

Width Scaling, Process Verification, and Thinking-as-Compression (2026-05-28)

4 notes

Can reasoning systems scale wider instead of only deeper?

Explores whether sampling multiple parallel latent trajectories offers a faster scaling path than recursive refinement alone. Matters because it could unlock latency-efficient reasoning at test time.

Can verifiers monitor reasoning without slowing generation down?

Explores whether asynchronous verification can catch reasoning errors while keeping token costs near parity with unmonitored reasoning. Matters because current approaches trade between catching early errors and computational overhead.

Can a reasoning model's thinking trace compress context effectively?

Does the raw reasoning trace produced by a thinking model naturally function as a context compressor without specialized training or modules? And how does this compare to dedicated compression methods?

Can thinking traces be made reliably budget-controllable?

Raw thinking traces compress well but ignore budget targets and take shortcuts. Can reward optimization make them controllable and useful for deployment?