Can preference trees structure alignment data for domains beyond math and code?
This explores whether the preference-tree data structure — branching trees of reasoning chains, critiques, and correct/incorrect pairs used to align reasoning models — can carry over into domains where there's no clean right answer, like writing or open-ended judgment.
This explores whether preference trees can structure alignment data beyond math and code — and the corpus suggests the format's power is tied to exactly the thing math and code provide for free: a verifiable correctness signal. The original result What alignment data structure best trains reasoning generalists? built state-of-the-art open reasoning by organizing each instruction as a tree of diverse planning strategies, critique trajectories, and pairwise comparisons. What makes that tree trainable is that every branch can be scored as correct or incorrect. The same logic shows up in function calling Can small models match large models on function calling?, where DPO's explicit negative examples beat plain supervised fine-tuning precisely because there's an objective format to be right or wrong about. Trees thrive wherever you can mechanically tell good branches from bad ones.
Move into subjective domains and that scoring step quietly breaks. In AI writing assistance Can user preference guide AI writing tool alignment?, writers preferred AI rewrites most of the time yet objected to the persona distortions baked into those same rewrites — polish and distortion turned out to be entangled at the model level. A preference tree built on that signal wouldn't just fail to help; it would faithfully encode the distortion as the 'winning' branch. The problem runs deeper than any one domain: annotation responses themselves Do all annotation responses measure the same underlying thing? decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, and treating them uniformly contaminates the reward signal. In math, a label is a fact; in writing, a 'preference' label may be three different things wearing the same coat.
So the honest answer is: the tree *structure* transfers fine, but the *labels that fill it* don't. The bottleneck isn't the data format — it's whether the domain hands you a trustworthy comparison. That reframes the real question as where reliable pairwise judgments come from. One route is scale: crowdsourced pairwise voting Can crowdsourced votes reliably rank language models? produces credible rankings on diverse open-ended prompts when the questions are discriminating enough and the crowd agrees with experts — suggesting some non-verifiable domains can still yield clean preference pairs at volume.
The more interesting move is to abandon preference labels entirely. SAMI Can models learn behavioral principles without preference labels? aligns models to written principles by maximizing the mutual information between a constitution and the response — no preference pairs, no demonstrations, and a weaker model could even author principles to align a stronger one. For domains where 'better' is contested, structuring alignment data around *principles* rather than *winners* may be the version of a tree that survives the trip out of math and code. And it pairs naturally with the curation lesson Can careful curation replace massive alignment datasets?: if post-training mostly activates capabilities the model already has, a small, carefully built tree of principle-grounded examples may beat a massive tree of noisy preference labels in any domain where the labels can't be trusted.
The thing you might not have known you wanted to know: preference trees aren't really a data structure for *preferences* — they're a data structure for *verifiable disagreement*. Where you can't verify, the question stops being 'how do we shape the tree' and becomes 'what do we hang on it instead of preference.'
Sources 7 notes
Eurus achieved state-of-the-art open-model reasoning by training on ULTRAINTERACT, an alignment dataset structured as preference trees per instruction. The tree format unified diverse planning strategies, interaction-and-critique trajectories, and pairwise data for both SFT and preference learning.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.
SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.