LIMOPro: Reasoning Refinement for Efficient and Effective Test-time Scaling

Paper · arXiv 2505.19187 · Published May 25, 2025
Test-Time ComputeInference-Time ScalingReasoning Critiques

Large language models (LLMs) have demonstrated remarkable reasoning capabilities through test-time scaling approaches, particularly when fine-tuned with chain-of-thought (CoT) data distilled from more powerful large reasoning models (LRMs). However, these reasoning chains often contain verbose elements that mirror human problem-solving, categorized as progressive reasoning (the essential solution development path) and functional elements (verification processes, alternative solution approaches, and error corrections). While progressive reasoning is crucial, the functional elements significantly increase computational demands during test-time inference. We introduce PIR (Perplexity-based Importance Refinement), a principled framework that quantitatively evaluates the importance of each reasoning step based on its impact on answer prediction confidence. PIR systematically identifies and selectively prunes only low-importance functional steps while preserving progressive reasoning components, creating optimized training data that maintains the integrity of the core solution path while reducing verbosity.

Introduction. Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks through chain-of-thought (CoT) [33], where models generate step-by-step solutions to problems. Recent advances of test-time scaling [12, 26] can significantly enhance LLMs’ reasoning abilities by increasing the compute at test time. One approach to the test-time scaling involves fine-tuning LLMs on high-quality reasoning data distilled from more powerful large reasoning models (LRMs) [20, 37]. LRMs like DeepSeek-R1 [10], OpenAI o1 [12], and QwQ [29] represent the state of the art in this paradigm, producing reasoning chains that lead to accurate solutions. However, this approach faces a significant challenge: reasoning chains distilled from LRMs often contain numerous functional elements that, while reflecting human problem-solving processes, possibly produce unnecessarily verbose outputs [7, 32, 4].

Discussion / Conclusion. Contributions This paper introduces PIR (Perplexity-based Importance Refinement), a novel framework that optimizes reasoning chains by quantitatively assessing step importance and selectively pruning low-value functional elements while preserving essential reasoning paths. Our comprehensive evaluation demonstrates that models fine-tuned on PIR-optimized datasets achieve both improved accuracy and significantly reduced token usage. By strategically balancing thorough problemsolving with computational efficiency, PIR establishes a principled approach for deploying advanced reasoning capabilities in latency-sensitive applications, opening new avenues for research on efficient reasoning in foundation models. Limitations While our approach demonstrates significant improvements, several limitations warrant further investigation. First, our evaluation primarily focuses on mathematical reasoning tasks and science tasks; future work should validate PIR’s effectiveness across broader reasoning domains including logical, commonsense, and causal reasoning.