SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Does supervised fine-tuning actually improve reasoning on optimization problems?

When SFT boosts benchmark scores on constraint-optimization tasks, does it genuinely improve the model's ability to find feasible solutions, or just its ability to format answers convincingly?

Synthesis note · 2026-05-18 · sourced from Reasoning Architectures

The constraint-optimization study runs a controlled comparison between SFT and RL (with constraint-satisfaction rewards) on the same problem class. The SFT result is the diagnostic of interest: SFT clearly improves the form of the answer — JSON structure, decimal places, valid identifiers, expected sections — without improving the feasibility of the answer against the actual physical constraints. The model learns to look like it is solving the problem.

This is the formatting-vs-feasibility gap, and it is a specific instance of a more general SFT failure mode. SFT trains the model to reproduce the surface features of correct demonstrations. The surface features of a feasible solution and the surface features of a confidently-wrong solution are nearly identical. SFT optimizes the loss on the visible tokens, not on whether those tokens encode a valid physical state. The result is fluently presented infeasibility.

RL with feasibility-targeted rewards moves the needle modestly on actual feasibility, because the reward signal directly penalizes the constraint violations that SFT could not see. This is a real but limited gain — it does not break the 55-60% plateau, but it disambiguates which kind of failure SFT was leaving uncorrected.

The methodological implication for fine-tuning practice: when the desired behavior involves correctness in a dimension the loss does not measure, SFT improvements should be treated with suspicion. A clean rise in benchmark score where the benchmark scores presentation rather than substance can simply mean the model has gotten better at looking right.

Inquiring lines that use this note as a source 22

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 138 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

SFT improves response formatting but not physical feasibility — formatting wins mask reasoning shortcuts