Can reward-guided decoding replace weight fine-tuning for personalized alignment?
This explores whether you can personalize a model's behavior at decoding time — steering its outputs with a reward signal — instead of retraining its weights for each user, and what the corpus says about the trade-offs.
This explores whether reward-guided decoding can stand in for weight fine-tuning when the goal is personalized alignment — and the corpus suggests it can go surprisingly far, with real advantages, but the two approaches end up doing different jobs. The strongest evidence for replacement comes from proxy-tuning, which shifts a model's output distribution at decoding time and closes 88–91% of the alignment gap while leaving base weights frozen Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The interesting twist is *why* the frozen-weight route is attractive: direct fine-tuning corrupts knowledge stored in the lower layers, while decoding-time steering touches mostly reasoning and style. So it's not just that decoding-time tuning is cheaper — it can actually preserve what the model knows better than retraining does.
For the *personalized* part specifically, the most direct answer is PReF, which represents each user's preferences as a small set of coefficients over a fixed library of base reward functions Can user preferences be learned from just ten questions?. Ten well-chosen questions are enough to locate a new person in that preference space, and the model is then aligned to them at inference time — no per-user weight update at all. This is the clearest existence proof that 'personalized alignment without fine-tuning' is a real thing and not just a slogan. A complementary route conditions a shared reward model on a learned text summary of the user, which turns out to capture preference dimensions that embeddings miss and even transfers to an off-the-shelf model like GPT-4 for zero-shot personalization Can text summaries beat embeddings for personalized reward models?. Together these say: the personalization can live in a lightweight, swappable signal rather than in the weights.
But 'replace' deserves a caveat the corpus keeps pointing at — decoding-time methods are only as good as the reward they follow, and reward quality is itself becoming a research frontier. Reward models score better when they reason before judging Can reward models benefit from reasoning before scoring?, and a reward signal can even be conjured from the model's own confidence rather than from human labels Can model confidence work as a reward signal for reasoning?. The richer and more reliable these signals get, the more weight reward-guided decoding can carry. The flip side: if your reward is a black box from somewhere else entirely — say recommendation metrics like NDCG — people are still reaching for RL weight training rather than pure decoding-time steering Can recommendation metrics train language models directly?.
There's also a quieter argument that the fine-tuning-vs-decoding framing is slightly false. LIMA shows that alignment is mostly *activating* capabilities the pretrained model already has, not installing new ones — 1,000 curated examples rival massive datasets Can careful curation replace massive alignment datasets?. If alignment is surfacing latent behavior rather than building it, then a decoding-time controller and a light fine-tune are two knobs on the same underlying dial, which is exactly why proxy-tuning can imitate a fine-tune so closely. And some capabilities can even be folded into training so cheaply they cost nothing at inference — models can learn to evaluate themselves in the unused space after their output Can models learn to evaluate their own work during training?.
The thing you might not have known you wanted to know: the real dividing line isn't 'decoding vs. weights,' it's *where the user-specific information lives and how often it changes.* Per-user preferences that shift constantly want to live in a cheap, hot-swappable reward signal — that's decoding-time territory, and it wins on knowledge preservation and per-user cost. Stable, shared behaviors that everyone needs are fine to bake into weights once. Reward-guided decoding doesn't replace fine-tuning so much as it relocates personalization out of the weights and into the signal — which, for a system serving many different people, is often the point.
Sources 8 notes
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Rec-R1 demonstrates that LLMs can be trained directly on rule-based recommendation metrics like NDCG and Recall as RL reward signals, eliminating the need for SFT distillation from proprietary models while remaining model-agnostic across different retriever architectures.
LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.