SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Do fine-tuned language models actually learn optimization procedures?

Can RL fine-tuning teach LLMs to solve constraint-optimization problems through genuine reasoning, or does it merely sharpen pattern-matching? Testing on out-of-distribution variants reveals the mechanism.

Synthesis note · 2026-05-18 · sourced from Reasoning Architectures

The constraint-optimization study uses a clean diagnostic to separate procedure from pattern: an N-case test set (in-distribution power-grid topologies) and an N-1 test set (the same problems with one element removed, putting them out of distribution while keeping the structure recognizable). A model running the actual procedure should perform comparably on both. A model running pattern-match should perform worse on N-1.

Even under GRPO and constraint-satisfaction-reward training, models degrade markedly on N-1. The conclusion is that RL on outcome-based rewards does not install the missing procedure — it sharpens the template-matching strategy along the in-distribution axis. The model gets better at recognizing patterns it has seen and worse, relatively, at adapting to perturbed structure.

This is methodologically important because it provides a probe that other reasoning evaluations lack. Most benchmarks cannot distinguish "the model solved this" from "the model recognized this." The N / N-1 comparison forces the distinction by holding the problem class fixed while perturbing the instance. The drop is the memorization signature.

For practitioners, the diagnostic generalizes. Wherever a deployment cares whether a model is computing or recalling — clinical reasoning, legal-statute reasoning, scientific problem-solving — building an "N-1" counterpart of the canonical test set is a cheap way to surface memorization. The structure-shift probe is more informative than headline accuracy on the canonical set.

Inquiring lines that use this note as a source 92

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 111 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

N-1 out-of-distribution tests reveal that RL fine-tuned LLMs still rely on memorization for optimization problems