Competitive Programming with Large Reasoning Models

Paper · arXiv 2502.06807 · Published February 3, 2025
Chain-of-Thought and Reasoning Methods

We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models — OpenAI o1 and an early checkpoint of o3 — with a domain-specific system, o1- ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a CodeForces rating on par with elite human competitors.

Introduction. Competitive programming is widely recognized as a challenging benchmark for evaluating reasoning and coding proficiency [2]. Solving complex algorithmic problems demands advanced computational thinking and problem solving skills. Moreover, these problems are also objectively gradable, making it an ideal testbed to assess the reasoning capabilities of AI systems. Recent work on program synthesis with large language models [1] has demonstrated that even relatively general models, ranging from 244M to 137B parameters, can generate short Python scripts from natural language instructions. Importantly, performance improves log-linearly with model size, and finetuning significantly boosts accuracy. Concurrently, Codex [2], an early code-focused LLM, excelled at Python program generation and powered GitHub Copilot.

Discussion / Conclusion. Through the o-series large reasoning models, we demonstrate that chain-of-thought reasoning is a powerful strategy for improving performance in coding tasks, from competitive programming benchmarks such as CodeForces and IOI to complex software engineering challenges like SWE-bench and Astra. Our findings highlight that increasing reinforcement learning training compute, coupled with enhanced test-time compute, consistently boosts model performance to nearly match the best humans in the world. Given these results, we believe o-series large reasoning models will unlock many new use cases for AI in science, coding, math, and many other fields.