SYNTHESIS NOTE

Can user preferences be learned from just ten questions?

Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.

Synthesis note · 2026-02-23 · sourced from Assistants Personalization

Standard RLHF trains a single reward model on aggregated human preferences, assuming a universal preference structure. PReF (Personalization via Reward Factorization) makes a different assumption: user preferences lie in a low-dimensional space and can be represented as weighted sums of a small set of base reward functions.

The three-stage architecture:

Base reward learning — train a set of base reward functions from paired preference data annotated with user identity. Each base function captures one dimension of preference variation (e.g., conciseness vs detail, formality vs casualness).
User coefficient inference — present the new user with a sequence of question-response pairs and ask which response they prefer. The questions are selected adaptively using active learning: each question is chosen to maximally reduce uncertainty about the user's coefficients. Results from logistic bandit theory enable efficient uncertainty computation.
Inference-time alignment — once user-specific coefficients are known, use inference-time methods to generate reward-aligned responses without modifying model weights. This enables scalable per-user adaptation.

The practical significance: 10-20 questions suffice. This is dramatically more efficient than approaches requiring historical interaction data or per-user fine-tuning. The active learning component is critical — random question selection would require far more queries because most questions are uninformative for distinguishing between users.

The low-dimensional preference assumption is both the strength and the limitation. If real preferences don't decompose into a small number of base dimensions, the factorization misses important variation. However, the survey evidence from How do personalization granularity levels trade precision against scalability? suggests that persona-level personalization (group-based, moderate dimensionality) is often sufficient and that user-level precision trades against data requirements.

The inference-time alignment component connects to Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Both avoid weight modification per user, but PReF applies a user-specific reward function while proxy tuning applies a task-specific distributional shift. The combination suggests a design space: different axes of adaptation (user preferences, task requirements, domain knowledge) can each be applied at inference time through different mechanisms.

Inquiring lines that use this note as a source 94

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 152 in 2-hop network ·medium cluster Open in graph ↗

Can user preferences be learned from just ten qu… Can text summaries beat embeddings for personalize… Can decoding-time tuning preserve knowledge better… Does chatbot personalization build trust or expose… Does preference data need more raters than example… Can aggregate reward models satisfy genuinely disa… Does personalizing reward models amplify user echo…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can text summaries beat embeddings for personalized reward models? When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
PLUS uses RL-trained text summaries; PReF uses factorized reward functions. Complementary approaches to the same problem.
Can decoding-time tuning preserve knowledge better than weight fine-tuning? Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
both are inference-time adaptation methods; different mechanisms
Does chatbot personalization build trust or expose privacy risks? Explores whether personalization features that increase user trust and social connection simultaneously heighten privacy concerns and create rising behavioral expectations over time.
PReF's explicit preference queries may increase privacy concerns vs implicit approaches
Does preference data need more raters than examples? Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?
theoretical companion: PReF demonstrates that 10-20 queries suffice empirically; the PAC bound provides the formal account of why — when reward features are learned from group data, generalization error decomposes into per-rater example count and per-feature rater count, and feature learning requires rater diversity not just example depth
Can aggregate reward models satisfy genuinely disagreeing users? When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
the motivating problem in sharper form: PReF was built to solve the disagreement-dilemma that aggregate RLHF cannot escape
Does personalizing reward models amplify user echo chambers? Personalized reward models solve the minority-preference problem but may introduce new risks by reinforcing existing user beliefs and narrowing exposure to diverse viewpoints.
productive caveat: the technical solution PReF provides creates new alignment risks; per-user reward specialization can reinforce existing views, amplify sycophancy, and accelerate opinion polarization at population scale

Can user preferences be learned from just ten questions?

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4