SYNTHESIS NOTE
Recommender Systems Conversational AI and Personalization

Can text summaries beat embeddings for personalized reward models?

When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

Standard RLHF models the entire user population with a single reward model. Prior pluralistic approaches either condition on embedding vectors (which compress text into single vectors, losing information) or use in-context learning with raw conversation histories (which hurts generalization across topics). PLUS proposes a third path: learn text-based summaries of user preferences via RL, then condition the reward model on these summaries.

The architecture is a co-adaptation loop. A summarizer is trained with PPO to generate user preference summaries from past conversation histories. A reward model is simultaneously trained to make personalized predictions conditioned on these summaries. The summarizer's reward signal is the reward model's predictive accuracy — so the summarizer learns which aspects of past conversations actually matter for predicting future preferences, rather than which topics were discussed.

The critical finding is that untrained summarizers focus on conversation topics ("the user asked about cats") rather than preference dimensions ("the user values concise, factual information"). RL training shifts attention to the dimensions that matter for prediction. Zero-shot summaries fail because they lack this discriminative signal.

The practical implications are significant: the text summaries are portable (transferring to GPT-4 for zero-shot personalization), interpretable (users can read and modify them), and concise. This connects to the broader tension between personalization and alignment. Since Does chatbot personalization build trust or expose privacy risks?, PLUS's transparent text summaries may offer a less opaque path to personalization than embedding-based approaches.

Complementary approaches form a design space for personalized alignment. PReF (Personalization via Reward Factorization) represents user preferences as weighted sums of base reward functions and infers per-user weights via active learning with only 10-20 preference queries — no historical data needed. P-RLHF takes a third approach: a lightweight user model captures individual preferences jointly with the LLM, handling both explicit preferences (stated) and implicit preferences (from feedback data) without pre-defined preference dimensions. The curiosity reward approach eliminates pre-conversation calibration entirely — the agent learns about the user during conversation by being rewarded for reducing uncertainty about user type (see Can conversations themselves personalize without user profiles?). Together, these methods span a spectrum: PLUS requires historical data but produces portable summaries; PReF requires 10 active queries but no history; curiosity reward requires nothing upfront but learns more slowly. The choice depends on available data and acceptable latency to personalization.

Inquiring lines that use this note as a source 39

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
20 direct connections · 172 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

learned text-based user preference summaries condition reward models more effectively than embedding vectors for pluralistic alignment