Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers built on Large Language Models (LLMs). These methods typically employ an LLM to produce an explicit, step-by-step reasoning process before arriving at a final relevance prediction. But, does reasoning actually improve reranking accuracy? In this paper, we dive deeper into this question, studying the impact of the reasoning process by comparing reasoning-based pointwise rerankers (ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under identical training conditions, and observe that StandardRR generally outperforms ReasonRR. Building on this observation, we then study the importance of reasoning to ReasonRR by disabling its reasoning process (ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more effective than ReasonRR.
Introduction. Recently, there has been a surge of interest in reasoning models such as DeepSeek-R1 (Guo et al., 2025), OpenAI’s o3, Qwen3 (Yang et al., 2025), and others. By generating an explicit reasoning process — i.e., a chain-of-thought (CoT) — prior to producing its final response, reasoning models have shown strong performance across a wide range of complex natural language tasks such as mathematics (Yang et al., 2024). Following the success of reasoning models, researchers in the Information Retrieval (IR) com-
Discussion / Conclusion. In this work, we study whether scaling test-time compute—via generation of reasoning tokens prior to making a relevance prediction—actually improves the accuracy of pointwise rerankers. To do so, we train and evaluate three pointwise rerankers, StandardRR, ReasonRR, and ReasonRR- NoReason. Through experiments across in-domain and out-of-domain datasets, we find that the reasoning process consistently harms the accuracy of pointwise rerankers, especially as LLM size increases. Investigating the root cause of this result, we observe that the reasoning process restricts ReasonRR’s ability to capture partial relevance between query-document pairs, which is an important factor for pointwise reranking accuracy. While we explored self-consistency as a potential remedy for this restriction, StandardRR still outperformed ReasonRR + Self-Consistency.