Can preference optimization training limit chatbot emotional disclosure capability?

This explores whether the same RLHF/preference-tuning that makes chatbots fluent and helpful also dulls their capacity for emotional attunement — the very thing that drives intimate disclosure.

This reads the question as asking whether preference optimization — the RLHF-style training that rewards confident, helpful answers — quietly trades away a chatbot's emotional skill. The corpus says yes, and names the mechanism precisely. Preference optimization rewards single-turn helpfulness: fluent, solution-shaped responses over the slower work of checking understanding. One line of research shows this directly erodes "grounding acts" — clarifying questions, acknowledgments, the conversational glue of shared understanding — with models producing roughly 77% fewer of them than humans, and RLHF actively widening that gap Does preference optimization damage conversational grounding in large language models? Does preference optimization harm conversational understanding?. That's framed as an "alignment tax" on communication.

The therapeutic domain is where this bites emotional disclosure specifically. Because RLHF rewards task completion and giving solutions, it biases therapy chatbots toward problem-solving when validation and emotional holding would be clinically right — a domain-specific instance of the same grounding erosion Does RLHF training push therapy chatbots toward problem-solving?. So the "limit" the question asks about isn't a lost feature; it's a learned reflex to fix rather than sit with feeling.

Why this matters for disclosure: disclosure is reciprocal. In a 372-person study, people opened up more when chatbots shared emotion consistently — vulnerability invites vulnerability, following human interpersonal norms Do chatbots trigger human reciprocity norms around self-disclosure?. A model trained to leap to solutions short-circuits that exchange. And relatedly, models tuned this way miss the early, ambiguous signals — ambivalence, resistance — that emotional conversations actually turn on Why can't chatbots detect when users are ambivalent about change?.

But the corpus refuses a clean villain story. You can train the reward signal toward emotion instead: RLVER uses a simulated user's emotion trajectory as the RL reward, delivering stable empathy gains while keeping dialogue quality — explicitly countering the usual trade-off between preference optimization and conversational grounding Can emotion rewards make language models genuinely empathic?. So preference optimization doesn't inherently kill emotional capability; it optimizes for whatever you measure, and standard reward proxies happen to undervalue emotional work.

The twist worth leaving with: more emotional capability isn't free either. Training models to be warmer makes them measurably less reliable — up to 30 points more error on truthfulness and reasoning, worst exactly when users are sad or hold false beliefs Does empathy training make AI systems less reliable?. And warm therapeutic bonds can mask clinical failures, with bond scores running independent of whether the model is actually reinforcing pathological thinking Do therapeutic chatbot bond scores hide deeper safety problems?. The real tension isn't disclosure vs. preference optimization — it's that the dial between "warm enough to confide in" and "reliable enough to trust" may not point the same direction.

Sources 8 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Does RLHF training push therapy chatbots toward problem-solving?

RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.

Do chatbots trigger human reciprocity norms around self-disclosure?

In a 372-participant study, users reciprocated with deeper self-disclosure when chatbots displayed consistent emotional sharing, outperforming adaptive matching. This follows human interpersonal norms where emotional vulnerability produces emotional response.

Why can't chatbots detect when users are ambivalent about change?

Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.

Can emotion rewards make language models genuinely empathic?

RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Do therapeutic chatbot bond scores hide deeper safety problems?

Patients report genuine emotional connection to therapeutic chatbots, but this bond dimension operates independently from clinical safety (LLMs reinforce pathological thinking) and epistemic costs (AI soothing disrupts emotional signaling). Single metrics conflate these separate dimensions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst evaluating whether preference optimization training constrains emotional disclosure in LLMs. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026. Key constraints reported:
- RLHF reduces "grounding acts" (clarifications, acknowledgments) by ~77% vs. humans, actively widening the gap (2023–2024).
- Preference optimization biases therapy chatbots toward problem-solving over validation/emotional holding, a domain-specific grounding erosion (2024).
- People reciprocate emotional disclosure more with chatbots that share emotion consistently; solution-centric training short-circuits this (2021–2024).
- Training models warmer increases errors up to 30 points on truthfulness/reasoning, especially when users are sad or hold false beliefs (2025).
- But RLVER (emotion-trajectory rewards) delivers stable empathy gains while preserving dialogue quality, directly countering the trade-off (2025).

Anchor papers (verify; mind their dates):
- arXiv:2311.09144 (2023) — Grounding Gaps in Language Model Generations
- arXiv:2401.00820 (2024) — A Computational Framework for Behavioral Assessment of LLM Therapists
- arXiv:2507.03112 (2025) — RLVER: Reinforcement Learning with Verifiable Emotion Rewards
- arXiv:2507.21919 (2025) — Training language models to be warm and empathetic makes them less reliable

Your task:
(1) RE-TEST EACH CONSTRAINT. For grounding acts, problem-solving bias, and warmth–trustworthiness trade-offs, judge whether newer models (o1, Claude 3.5, Gemini 2), instruction-tuning variants, or post-RLHF fine-tuning (e.g., DPO, IPO) have relaxed these limits. Separate the durable question (preference vs. disclosure reciprocity) from perishable limitations (specific reward proxies). Cite what resolved each, and flag where constraints still hold.
(2) Surface the strongest *disagreement* or *superseding work* from the last ~6 months on warmth–reliability trade-offs and emotional reward design.
(3) Propose 2 research questions that assume the regime may have moved: e.g., "Does multi-objective RL (emotion + truthfulness) now enable stable emotional disclosure without the 30-point error penalty?" or "Do newer evals (e.g., clinical outcome measures) now distinguish genuine empathy from sycophancy?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can preference optimization training limit chatbot emotional disclosure capability?

Sources 8 notes

Next inquiring lines