Can user preferences be learned from just ten questions?
Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
Standard RLHF trains a single reward model on aggregated human preferences, assuming a universal preference structure. PReF (Personalization via Reward Factorization) makes a different assumption: user preferences lie in a low-dimensional space and can be represented as weighted sums of a small set of base reward functions.
The three-stage architecture:
Base reward learning — train a set of base reward functions from paired preference data annotated with user identity. Each base function captures one dimension of preference variation (e.g., conciseness vs detail, formality vs casualness).
User coefficient inference — present the new user with a sequence of question-response pairs and ask which response they prefer. The questions are selected adaptively using active learning: each question is chosen to maximally reduce uncertainty about the user's coefficients. Results from logistic bandit theory enable efficient uncertainty computation.
Inference-time alignment — once user-specific coefficients are known, use inference-time methods to generate reward-aligned responses without modifying model weights. This enables scalable per-user adaptation.
The practical significance: 10-20 questions suffice. This is dramatically more efficient than approaches requiring historical interaction data or per-user fine-tuning. The active learning component is critical — random question selection would require far more queries because most questions are uninformative for distinguishing between users.
The low-dimensional preference assumption is both the strength and the limitation. If real preferences don't decompose into a small number of base dimensions, the factorization misses important variation. However, the survey evidence from How do personalization granularity levels trade precision against scalability? suggests that persona-level personalization (group-based, moderate dimensionality) is often sufficient and that user-level precision trades against data requirements.
The inference-time alignment component connects to Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Both avoid weight modification per user, but PReF applies a user-specific reward function while proxy tuning applies a task-specific distributional shift. The combination suggests a design space: different axes of adaptation (user preferences, task requirements, domain knowledge) can each be applied at inference time through different mechanisms.
Inquiring lines that use this note as a source 94
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does learning community preferences as training rewards operationalize prediction without participation?
- Why does belief-specific tailoring work better than demographic personalization?
- How do attribute-asking strategies depend on current confidence in candidate items?
- How should preference channels from historical sessions inform unified policy learning?
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- Does universal approximation guarantee help with finite recommendation data?
- How much task-relevant persona information is needed for accurate preference prediction?
- Do verbal uncertainty estimates calibrate better than confidence scores for personalization?
- Can curiosity-driven dialogue incrementally discover user interest journeys in real time?
- What makes historical user outputs more effective for personalization than semantic similarity?
- How does personalization create tradeoffs between trust and privacy concerns?
- Can systems guide users adaptively without imposing predetermined dialogue structures?
- Why do one-shot studies fail to capture personalization effects?
- Which personalization techniques expose user data most directly?
- Can personalized questions improve conversation quality in open-domain chat?
- How does asymmetric information shape what to ask users first?
- How do neural networks extend contextual bandits beyond linear reward assumptions?
- How can a single policy handle both asking preferences and recommending items?
- Can curiosity-driven personalization work better than pre-conversation preference elicitation?
- How do intrinsic motivation mechanisms differ between social proactivity and personalization?
- How much user interaction data is needed for effective AI personalization?
- Why do real-world platforms need inductive learning for streaming recommendation systems?
- How should aspect selection adapt across different item categories and users?
- What real-world applications have context distributions that enable exploration-free bandits?
- What makes behavior relevance scoring against candidates more effective than fixed user profiles?
- How should recommendation systems balance individual preference signals with population-level patterns?
- Can side information alone predict preferences without rating history?
- How does personalization differ mechanically from retrieval-augmented generation?
- Can preference dimensions extracted from outputs replace topic-based user summaries?
- How do input length constraints reshape personalization system design choices?
- Why might text-only interfaces underestimate agent preference elicitation capabilities?
- Why do explicit ratings fail to capture uncertainty in user preferences?
- Can curiosity rewards about user type complement general social motivation frameworks?
- Could reward signals incentivize active intent discovery over passive response generation?
- Can attribute-specific preference optimization improve question quality in information-seeking?
- Why does sparsity per user make probabilistic models more effective?
- Why do standard preference alignment methods fail at the individual user level?
- How does textual-only feedback limit what a persona can learn about users?
- Does semantic memory improve AI personalization more than episodic memory?
- How do text-based preference summaries compare to embedding vectors for conditioning?
- Can reward models be personalized if annotators lack stable preferences?
- How can agents detect whether users are willing to follow their topic guidance?
- How can agents learn to estimate user satisfaction in real-time during conversation?
- Can question quality be trained separately from the decision to ask?
- Can agents balance goal-driven proactivity with user preference alignment?
- What role does uncertainty reduction play in personalized agent interaction?
- Can reward-guided decoding replace weight fine-tuning for personalized alignment?
- How do Bayesian models share statistical strength across sparse user datasets?
- Do weight changes in recommender systems produce faster producer adaptation when content is automated?
- How does active learning reduce queries needed for user preference inference?
- When does low-dimensional preference factorization miss important user variation?
- What preference dimensions do base reward functions typically capture?
- How do inference-time reward methods compare to per-user fine-tuning?
- Can linear bandit methods scale beyond their original reward assumptions?
- Can abstract preference summaries substitute for specific user interaction history?
- Can input-only training encode user preferences without task-specific labels?
- How does task-oriented fine-tuning compare to preference tuning methods?
- What distinguishes genuine user preferences from similar-user preferences in sparse data?
- Could AI agents scale the friend-with-different-preferences recommendation mechanism?
- Can hypernetworks generate recommendation parameters more efficiently than retraining full models?
- How can insert-expansion techniques help users discover their own preferences?
- What multi-turn reward structures would encourage active intent discovery?
- Why does persona-level information often fail to predict individual preferences?
- How can recommendation models handle per-user concept drift instead of global drift?
- Can active learning queries personalize reward models with few examples per user?
- How do reward features learned from group data generalize to new users?
- What makes minority preferences disappear in aggregated single-distribution reward models?
- How do personalized reward models avoid excluding minority viewpoints?
- Can reward factorization actually scale personalization to large user bases?
- When does clustering users by preference overcome the aggregation dilemma?
- Can personalized reward models amplify sycophancy without ethical guardrails?
- Can smaller judge models better capture human preferences than larger prompted models?
- Can users modify their preference summaries to steer model behavior?
- How can agents learn user preferences during conversation without pre-calibration?
- How do aggregate reward models fail to capture minority user preferences?
- Can personalized systems reward honest disagreement instead of user confirmation?
- Can user preferences be represented as linear reward combinations?
- Can reward models distinguish between personal preference and community consensus?
- What makes policy discrimination scalable where preference annotation hits bottlenecks?
- Do personalized reward models work better than one-size-fits-all approaches?
- Can we cheaply estimate which samples are currently most informative?
- How do pairwise comparisons convert subjective quality into trainable ranking signals?
- Can variational inference recover user-specific reward models from preference comparisons?
- Can rich environment feedback replace human preference labels entirely?
- How do binary comparisons constrain reward scale in multi-user preference learning?
- Can better prompting techniques overcome weak personalization in recommender systems?
- Why does single-reward RLHF fail to represent diverse human preferences?
- How do static benchmarks fail to capture human preference alignment?
- What validity threats exist in crowdsourced preference signals?
- What makes user-decision rewards better than model-confidence rewards?
- How can models select the optimal question to ask given multiple uncertainties?
- Can compact reward function representations beat text based personalization approaches?
- How do aggregate reward models systematically exclude minority preferences?
- Can latent-variable reward models capture multimodal preference distributions?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can text summaries beat embeddings for personalized reward models?
When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
PLUS uses RL-trained text summaries; PReF uses factorized reward functions. Complementary approaches to the same problem.
-
Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
both are inference-time adaptation methods; different mechanisms
-
Does chatbot personalization build trust or expose privacy risks?
Explores whether personalization features that increase user trust and social connection simultaneously heighten privacy concerns and create rising behavioral expectations over time.
PReF's explicit preference queries may increase privacy concerns vs implicit approaches
-
Does preference data need more raters than examples?
Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?
theoretical companion: PReF demonstrates that 10-20 queries suffice empirically; the PAC bound provides the formal account of why — when reward features are learned from group data, generalization error decomposes into per-rater example count and per-feature rater count, and feature learning requires rater diversity not just example depth
-
Can aggregate reward models satisfy genuinely disagreeing users?
When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
the motivating problem in sharper form: PReF was built to solve the disagreement-dilemma that aggregate RLHF cannot escape
-
Does personalizing reward models amplify user echo chambers?
Personalized reward models solve the minority-preference problem but may introduce new risks by reinforcing existing user beliefs and narrowing exposure to diverse viewpoints.
productive caveat: the technical solution PReF provides creates new alignment risks; per-user reward specialization can reinforce existing views, amplify sycophancy, and accelerate opinion polarization at population scale
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Language Model Personalization via Reward Factorization
- Capturing Individual Human Preferences with Reward Features
- Personalized Language Modeling from Personalized Human Feedback
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
- Enhancing personalized multi-turn dialogue with curiosity reward
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
- Collaborative Rational Speech Act: Pragmatic Reasoning for Multi-Turn Dialog
Original note title
reward factorization represents user-specific preferences as linear combinations of base reward functions — 10 active-learning queries suffice for personalization