Can compact reward function representations beat text based personalization approaches?
This explores whether learning a small set of reward-function parameters (compact numeric representations of what a user wants) can outperform describing a user in natural-language text — and the corpus suggests the honest answer is 'it depends what you're optimizing for,' because the two approaches win on different axes.
This explores whether compact reward-function representations — a user reduced to a handful of learned coefficients — can beat the alternative of describing that user in plain text. The corpus stages this as a genuine contest, and the most direct answer comes down on the *text* side. The PLUS work Can text summaries beat embeddings for personalized reward models? trains the summarizer and the reward model together and finds that text-based preference summaries condition a reward model more effectively than embedding vectors, capturing dimensions that zero-shot summaries miss — and as a bonus those summaries stay human-readable and even transfer to other models like GPT-4. So on raw conditioning quality, the verbose representation wins.
But 'compact' has a second meaning the corpus rewards: not how *expressive* the representation is, but how *cheaply you can acquire it*. Here the reward-coefficient camp shines. PReF Can user preferences be learned from just ten questions? learns a fixed set of base reward functions, then represents any individual as a linear combination of them — and can pin down a new user's coefficients with about ten well-chosen questions, using active learning to ask only what reduces uncertainty most. No weights are retrained; personalization happens at inference time. That's a very different proposition from writing and maintaining a text dossier on every user. The compact representation loses on richness but wins on data-efficiency and cold-start.
There's a third voice that reframes the whole question. The PRIME framework Does abstract preference knowledge outperform specific interaction recall? finds that *abstract* preference knowledge — summaries or parametric encodings — consistently beats *episodic* recall of specific past interactions. The interesting wrinkle is that PRIME treats text summaries and compact parametric encodings as two flavors of the same winning strategy (semantic abstraction), both beating the retrieve-the-raw-history approach. Read alongside PLUS and PReF, this suggests the real divide isn't text-vs-coefficients at all — it's *abstraction vs. raw recall*. Both contenders in your question are on the abstraction side; the thing they jointly beat is naively replaying interaction logs.
Worth knowing before you commit to per-user reward functions: there's a sharp downside. Specializing a reward model to each individual removes the quiet averaging that aggregate reward models provide, and the corpus warns this lets systems learn sycophancy and harden echo chambers at scale, mirroring recommender-system failure modes Does personalizing reward models amplify user echo chambers?. The compact, sharply-fit reward function is exactly the thing that overfits to telling you what you want to hear. A text summary, being interpretable, at least lets a user *see* and correct the caricature the system has built.
If you want to go wider, the corpus has adjacent moves worth a look: reward models that *reason* before scoring rather than emitting a single number Can reward models benefit from reasoning before scoring?, LLMs that *construct* reward functions from simplified problem abstractions Can LLMs design reward functions for reinforcement learning?, and the recommender literature's parallel discovery that users aren't single latent vectors but mixtures of attention-weighted personas Can attention mechanisms reveal which user taste explains each recommendation? — a hint that *neither* a single coefficient vector nor a single text blurb may be the right shape for a person in the first place.
Sources 7 notes
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
MEDIC shows that LLMs can generate effective reward shaping functions by first solving a deterministic, simplified version of the RL problem, then converting the resulting plan into shaping rewards for the original stochastic task. A model-based critic validates LLM outputs before deployment.
AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.