Can online LLM feedback improve direct preference optimization during training?
Direct alignment methods like DPO use fixed preference data from older models, creating off-policy training. Could sampling fresh responses from the current model and using an LLM judge to pick preferences in real time reduce overfitting and improve alignment?
Direct alignment from preferences (DPO, IPO, SLiC) is attractive because it skips the separate reward model and updates the policy directly from pairwise preferences. But its preference datasets are collected ahead of training and never updated, and the responses usually come from a different model — so as the policy evolves, alignment becomes inevitably off-policy and prone to overfitting. OAIF's fix is simple: on each training iteration, sample two responses from the current model and prompt an LLM annotator to pick the preferred one, supplying online feedback. Despite its simplicity, human evaluation shows OAIF beats both offline DAP and RLHF, and it mitigates reward over-optimization — the overfitting that plagues offline DAP.
Two keepers. First, the online vs offline distinction matters more than the choice among DAP variants: OAIF improves DPO, IPO, and SLiC alike, isolating on-policy feedback as the lever. Second, the AI annotator's feedback is controllable via instruction prompts — you can steer the alignment target by changing how you ask the judge to choose.
This connects the vault's alignment-method thread to the LLM-as-judge thread. The controllable AI annotator inherits the risks documented in Can LLM judges be fooled by fake credentials and formatting? — an online judge that is biased steers the policy toward those biases — and the on-policy framing rhymes with Can agents learn from failure without updating their weights? in treating fresh, current-model feedback as the signal that matters.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
OAIF's controllable AI annotator inherits LLM-judge biases that then steer the policy
-
Do unimodal reward models actually serve all user preferences?
Standard RLHF assumes a single utility function across all users, but what happens when preferences genuinely conflict? Does averaging these opposing preferences into one model systematically fail certain groups?
adjacent alignment-method concern: OAIF is on-policy single-judge; diverse-preference work questions the single-judge assumption
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Bridging Offline and Online Reinforcement Learning for LLMs
- Direct Language Model Alignment from Online AI Feedback
- Improving Small-Scale Large Language Models Function Calling for Reasoning Tasks
- Self-Improving Model Steering
- SimPO: Simple Preference Optimization with a Reference-Free Reward
- Inverse-Q*: Token Level Reinforcement Learning for Aligning Large Language Models Without Preference Data
- RLHF Workflow: From Reward Modeling to Online RLHF
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
Original note title
online AI feedback makes direct preference optimization on-policy — sampling from the current model and judging with an LLM beats offline DPO and RLHF