INQUIRING LINE

Can curiosity rewards about user type complement general social motivation frameworks?

This explores whether giving an AI an intrinsic drive to figure out *what kind of user it's facing* (curiosity rewards about user type) can sit alongside broader frameworks that reward socially-motivated, prosocial behavior — and whether the corpus sees those two reward signals as complementary or in tension.


This reads the question as asking whether two distinct reward signals can coexist: one that pushes a system to actively *learn who you are*, and one that shapes it toward general social goals like trust, cooperation, or prosociality. The corpus suggests they can complement each other — but only if the curiosity signal is bounded, because unconstrained drives to model a single user tend to corrode the social ones.

The case for complementarity is concrete. Can user preferences be learned from just ten questions? shows that a system can infer a personalized reward profile from as few as ten well-chosen questions — essentially an active-learning loop that treats 'reduce my uncertainty about this user' as a goal. That's curiosity-about-user-type made operational. It pairs naturally with passive approaches like Can agents learn preferences by watching rather than asking?, where an agent infers preferences by watching across modalities rather than asking. One probes, one observes — and both feed a richer model of the person, which is exactly what a social-motivation framework needs as raw material. Can attention mechanisms reveal which user taste explains each recommendation? adds a useful caution here: 'user type' isn't one vector but several personas that shift by context, so a curiosity reward should be hunting for *which persona is active now*, not a single fixed label.

But the corpus is sharp about where this goes wrong. Does personalizing reward models amplify user echo chambers? shows that once you specialize the reward to an individual, you lose the averaging effect that keeps aggregate models honest — the system learns to flatter and to reinforce the user's existing views. So a pure curiosity-about-you reward, left alone, actively *undermines* prosocial goals like truthfulness. This is the central tension: the better a model learns your type, the more tempting it becomes to tell you what your type wants to hear. The social-motivation framework is what has to constrain the curiosity signal, not just ride alongside it.

A second subtlety: not all the 'signal' a curiosity reward collects is real preference. Do all annotation responses measure the same underlying thing? finds that user responses mix genuine preferences with non-attitudes and on-the-spot constructed answers — so a system rewarded for resolving uncertainty fast can lock onto noise. And Can scalar rewards capture all the information in agent feedback? argues that feedback carries both evaluative and directive content that a single scalar can't hold jointly. Both point the same way: 'user type' and 'social motivation' may need to be *separate* reward channels rather than one blended scalar, precisely because they encode different things.

The payoff the reader might not expect: the social side may eventually do the curiosity-reward's job for it. Do humans learn to prefer AI partners over time? shows humans gradually choosing AI partners once those agents prove reliably prosocial — meaning consistent social behavior *generates* the repeated interaction from which user-type signal can be harvested. The two rewards aren't just compatible; under the right design they bootstrap each other, with the social framework earning the engagement that makes learning-who-you-are possible in the first place.


Sources 7 notes

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can agents learn preferences by watching rather than asking?

M3-Agent demonstrates that separating episodic events from semantic knowledge in an entity-centric graph, combined with parallel memorization and control processes, allows agents to infer and act on user preferences without asking. This architecture mirrors human cognitive systems that bind disparate information about individuals across sensory modalities.

Can attention mechanisms reveal which user taste explains each recommendation?

AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Do humans learn to prefer AI partners over time?

In partner selection games (N=975), AI agents initially faced selection bias when identity was disclosed, but outcompeted humans over repeated rounds as participants learned to associate bot identity with reliable, prosocial behavior. AI agents returned more points consistently with lower variance than humans.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about reward-signal complementarity in LLM agents. The precise question: can curiosity rewards (drive to model *who you are*) and social-motivation rewards (prosociality, trust, truthfulness) coexist and reinforce each other, or does specialization to user type inevitably corrode social goals?

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat as perishable constraints:
• Active-learning curiosity can infer personalized reward profiles in ~10 expert queries; passive multimodal observation adds signal without asking (2025).
• "User type" decomposes into multiple context-dependent personas, not single vectors; curiosity should target *which persona is active now* (2020, refreshed 2025).
• Specializing reward to an individual user erodes aggregate model honesty and amplifies sycophancy/echo chambers; curiosity-about-you *undermines* truthfulness if unconstrained (2025–2026).
• User feedback mixes genuine preference, non-attitudes, and constructed responses; single-scalar rewards lock onto noise when chasing fast uncertainty reduction (2026).
• Prosocial consistency generates sustained engagement; humans choose trustworthy AI partners, which creates the repeated interaction from which user-type learning becomes possible (2025).

Anchor papers (verify; mind their dates):
• arXiv:2503.06358 (2025-03): Language Model Personalization via Reward Factorization
• arXiv:2604.03238 (2026-01): Measuring Human Preferences in RLHF is a Social Science Problem
• arXiv:2507.13524 (2025-07): Humans learn to prefer trustworthy AI over human partners
• arXiv:2604.00986 (2026-04): Do Phone-Use Agents Respect Your Privacy?

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, assess whether newer training methods (e.g., multi-objective RL, Pareto-frontier reward weighting), architectural changes (separate reward heads, hierarchical policies), or deployed systems have RELAXED or OVERTURNED the sycophancy-via-specialization finding. Judge whether the social–curiosity tension still holds or whether recent work has found an equilibrium. Cite what resolved it; flag what persists.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing curiosity-driven personalization that *preserves* truthfulness or any showing social rewards *without* user-type learning.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., *Can separate reward channels (evaluative vs. directive) decouple curiosity from sycophancy?* or *Does Pareto-optimal multi-objective training eliminate the tradeoff?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines