SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can models learn to evaluate their own work during training?

Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.

Synthesis note · 2026-02-23 · sourced from Novel Architectures

Current training paradigms terminate learning at the end-of-sequence token, wasting the entire sequence space after model output completion. Post-Completion Learning (PCL) systematically exploits this neglected space. A temporary termination marker (<-- post-completion -->) creates a "post-thinking" space where models continue generating self-assessments and reward predictions during training, while inference stops at the marker — zero additional cost at deployment.

The core innovation is white-box reinforcement learning: the model explicitly learns to understand and compute reward functions, internalizing the reward model as its own evaluation capability. This transforms the model from "passive reward acceptance" (external reward signal tells it what's good) to "active self-evaluation" (it learns to compute quality assessments itself).

Implementation uses dual-track SFT: one track optimizes reasoning, the other optimizes evaluation capability. These are mixed with RL training for multi-objective hybrid optimization. The model learns both to solve problems and to assess its own solutions — but critically, only the problem-solving capability is active during inference. The self-evaluation is internalized during training, shaping the model's generation without requiring explicit self-assessment at inference time.

This addresses three limitations simultaneously:

  1. SFT's passive learning — models learn to mimic demonstrations without developing self-assessment ability
  2. RL's external dependency — reward models are opaque external components; PCL internalizes the evaluation
  3. Self-correction's inference cost — methods like Self-Refine require additional generation passes; PCL's self-evaluation is absorbed into training

The parallel with human cognition is direct: "Humans, after completing a task, often engage in self-reflection and quality assessment — this post-thinking process is crucial for improving future performance." PCL operationalizes this for LLMs.

This connects to What limits how much models can improve themselves? — PCL attempts to close the gap by training the verifier and generator as the same model, with the verification capability internalized rather than external. It also complements Does reflection in reasoning models actually correct errors? — PCL's self-evaluation is trained against ground-truth reward functions, not against the model's own prior outputs, potentially avoiding the confirmatory pattern.

Inquiring lines that use this note as a source 122

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 113 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

post-completion learning uses the ignored post-eos space to internalize self-evaluation during training with zero inference cost