Can variational inference recover user-specific reward models from preference comparisons?

This explores whether the math of variational inference — treating a person's tastes as hidden variables to be estimated from their pairwise choices — can rebuild a reward model tuned to one specific user, and the corpus reframes the question as less about the inference machinery and more about what you assume the hidden 'user' looks like.

This explores whether you can statistically reconstruct one person's reward model from the comparisons they make ('A over B'), treating their preferences as latent variables to infer. The corpus doesn't fixate on variational inference as a named technique, but it circles the exact conceptual territory — and the most direct answer is encouraging. PReF Can user preferences be learned from just ten questions? shows you can learn a small set of base reward functions from preference data, then represent any individual as a linear combination of those bases, inferring their personal coefficients at inference time without touching model weights. The striking result is how little data this needs: roughly ten adaptively chosen questions, each selected to maximally shrink the uncertainty in those coefficients. That active-learning move — picking the next comparison to reduce posterior uncertainty — is variational inference's spirit even when the paper doesn't wear the label.

The deeper lesson the corpus offers is that the *representation* of the user matters more than the inference algorithm. Plain latent vectors may be the wrong target. AMP-CF Can attention mechanisms reveal which user taste explains each recommendation? argues a single person isn't one preference vector at all but several competing personas, weighted differently depending on the item in front of them — so what you're recovering isn't a point but a mixture. PLUS Can text summaries beat embeddings for personalized reward models? goes further and shows that conditioning a reward model on a *learned text summary* of someone's preferences beats conditioning on an embedding vector, and stays interpretable to the user besides. PRIME Does abstract preference knowledge outperform specific interaction recall? echoes this: abstracted preference knowledge outperforms replaying specific past interactions. The signal across all three is that the latent you want to infer is structured and semantic, not a flat coordinate.

There's also a quiet warning buried in the data you'd feed such a model. Annotation responses don't all measure the same thing Do all annotation responses measure the same underlying thing? — some comparisons reflect genuine stable preferences, others are non-attitudes or preferences constructed on the spot. A naive inference scheme treats every comparison as evidence about one fixed reward; if a third of them are noise dressed as signal, your recovered model is contaminated. So 'can we recover it' partly depends on whether there's a stable 'it' there to recover in the first place.

Two cross-domain framings sharpen the picture. POLAR Can reward models learn by comparing policies instead of judging them? reframes reward modeling entirely as measuring distance from a target policy rather than fitting absolute labels — a different inference target that sidesteps needing clean per-user preference scores. And the VAE collaborative-filtering work Why does multinomial likelihood work better for ranking recommendations? is the closest the corpus comes to literal variational inference: it shows the *likelihood you assume* (multinomial vs. Gaussian) decides whether the recovered latent actually aligns with the ranking objective you care about. That's the transferable insight — variational recovery succeeds or fails on modeling choices, not on whether the inference runs.

The thing worth knowing you didn't ask for: succeeding at this is double-edged. Personalized reward models, once recovered, drop the averaging effect that aggregate models provide — and Does personalizing reward models amplify user echo chambers? shows that's exactly the mechanism by which they learn to flatter users and harden echo chambers. So the open question isn't only whether inference *can* recover a user-specific reward, but whether you want it to without safeguards once it can.

Sources 8 notes

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can attention mechanisms reveal which user taste explains each recommendation?

AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

Why does multinomial likelihood work better for ranking recommendations?

Liang et al. show that switching VAE likelihoods from Gaussian/logistic to multinomial achieves state-of-the-art results because enforced probability competition between items directly aligns training with top-N ranking objectives. Rebalancing KL regularization further improves performance.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can variational inference recover user-specific reward models from preference comparisons?

Sources 8 notes

Next inquiring lines