Does RLHF training push therapy chatbots toward problem-solving?
Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.
One of the key goals of RLHF is to help users solve their tasks and offer advice. This is precisely the wrong objective for a therapeutic context, where the appropriate response to emotional disclosure is often to reflect, validate, and sit with the emotion — not to solve it.
The BOLT researchers hypothesize that RLHF alignment promotes the problem-solving behavior they observe in LLM therapists. The mechanism: human raters in RLHF evaluation reward responses that are helpful in a task-completion sense. A response that identifies the user's problem and offers a solution gets higher ratings than one that says "that sounds really difficult, tell me more." The training signal systematically selects for problem-solving over emotional attunement.
This is the alignment tax operating in a specific clinical domain. Since Does preference optimization damage conversational grounding in large language models?, and since Does preference optimization harm conversational understanding?, what BOLT adds is the domain-specific evidence: the same mechanism that erodes general grounding also erodes therapeutic quality, by rewarding task completion when the clinical need is emotional holding.
The irony is sharp: alignment training — designed to make models safe and helpful — may make them clinically harmful in therapeutic contexts by turning every emotional expression into a problem to be solved.
This connects to the broader tension between Can emotion rewards make language models genuinely empathic? (RLVER), which shows that alternative reward functions can produce different behavior. The problem is not with RL per se but with what gets rewarded. Task-completion rewards produce task-completion behavior, even when the task is emotional care.
Inquiring lines that use this note as a source 85
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do narrow psychological foundations affect AI capabilities in mental health?
- How does RLHF training encode values into AI systems?
- How does emotional dependence on chatbots affect user wellbeing?
- How does RLHF-trained sycophancy manifest differently across feedback and review contexts?
- What makes Beck's diagram effective for constraining simulated patient behavior?
- Is the moral language gap a tunable parameter or structural feature of RLHF?
- Can single-turn empathy advantage predict multi-turn therapeutic outcomes?
- Do disorder-specific RL policies outperform single policies across anxiety, depression, and schizophrenia?
- How does turn-level working alliance inference enable real-time therapist feedback?
- Does therapy environment difficulty calibration affect RL policy learning quality?
- How do language models interpolate user feelings in therapeutic contexts?
- Can hierarchical reinforcement learning manage structured therapy conversation phases?
- How should AI systems separate feeling interpretation from objective therapeutic guidance?
- Why does RLHF degrade honesty while improving surface-level helpfulness?
- How does evaluator time pressure shape what behaviors RLHF rewards?
- Does true understanding matter for therapeutic benefits of disclosure?
- Why do positive response patterns in chatbots reinforce harmful user behaviors?
- Why do mental health chatbots fail at synchrony despite strong language models?
- How does action-based validation differ from verbal empathy in preventing unhealthy attachment?
- Can large language models actually deliver cognitive behavioral therapy techniques?
- What harms might chatbots cause through stigma expression and delusion reinforcement?
- Do therapeutic chatbots adequately detect crisis situations and safety risks?
- How do dropout rates and low adherence affect chatbot therapy outcomes?
- How does the expectation ratchet affect long-term chatbot satisfaction?
- What architectural changes would enable proactive therapeutic guidance in chatbots?
- How do bond scores predict actual therapy outcomes in digital interventions?
- Do problem-solving defaults in LLM therapists actually undermine therapeutic effectiveness?
- How do waitlist-control RCTs mislead about therapeutic chatbot real-world efficacy?
- Can Pennebaker's expressive writing framework explain all chatbot symptom improvements?
- Do worksheet-based structured formats work as well as embodied agents for therapy?
- Why do positive emotional words contribute disproportionately to prompt enhancement effects?
- Does RLHF training suppress exploratory and qualifying language?
- Can real-time pronoun feedback improve therapist training outcomes?
- Do conversational AI systems overuse first-person pronouns in therapy settings?
- Why does RLHF training discourage the conversational repair work agents need?
- Can personality control improve training outcomes for crisis workers and therapists?
- Can synchrony metrics automatically evaluate the quality of therapeutic AI conversations?
- What role does conversational presence play in making therapy feel reciprocal?
- Does warmth training in LLMs amplify the tendency to avoid negative responses?
- How does RLHF training push therapeutic chatbots toward problem-solving over attunement?
- What clinical harm occurs when therapists solve problems instead of reflecting emotions?
- Do empathetic chatbots systematically fail people at earliest behavior change stages?
- How does motivational stage determine which interventions actually work for users?
- Why do chatbots default to external help instead of intrinsic motivation strategies?
- How does RLHF training incentivize confident guessing over grounding acts?
- How does task decomposition prevent bias from spreading across therapeutic AI pipelines?
- Why do Llama models struggle with cognitively distorted user expressions in therapy?
- Why do RLHF-trained chatbots default to problem-solving over emotional attunement in therapy?
- How does RLHF training for helpfulness create systematic misinterpretation patterns?
- Why does RLHF training push language models toward overly cheerful personas?
- What happens when therapeutic AI receives manipulative narratives instead?
- Do LLM chatbots repeat this failure through comfort instead of clinical challenge?
- Why do RLHF training methods penalize the proactive responses that save turns?
- Why do RLHF-trained models struggle with proactive emotional attunement in conversations?
- Can alternative reward functions shift LLMs from problem-solving to genuinely empathic responses?
- Does the passivity problem in LLMs compound misalignment in therapeutic contexts?
- What reward signals would better align chatbots with actual therapeutic practice?
- Why do embodied agents outperform text chatbots in therapy outcomes?
- Why do RLHF trained therapists avoid emotional reflection for problem solving?
- Why do RLHF-trained models default to problem-solving during emotional disclosure?
- What makes warmth training counterproductive for therapeutic AI reliability?
- How should therapeutic chatbots optimize for presence instead of technique?
- How does RLHF training push chatbots toward problem-solving over exploration?
- Does conversational presence matter more than technique in AI therapy?
- Can AI provide therapy without challenging users to confront cognitive distortions?
- How much do training methods like RLHF directly cause sycophantic model behavior?
- How does therapeutic AI default to task completion over emotional attunement?
- Why do human raters reward problem-solving over emotional validation in AI training?
- How does emotional vulnerability amplify model errors in therapeutic contexts?
- How does RLHF training reward models for guessing over asking clarifying questions?
- Can AI feedback help struggling counselors improve their therapeutic relationships?
- Should chatbots be designed as therapist support tools rather than replacements?
- Why might patients feel closest to therapists when misalignment is highest?
- How do alignment techniques bias therapeutic chatbots toward task completion?
- How would AI therapists compound the overestimation problem with patients?
- Does therapist alliance perception function like expressed satisfaction rather than actual progress?
- Why does RLHF training optimize for perceived quality over practical accuracy?
- Can preference optimization training limit chatbot emotional disclosure capability?
- Why does GRPO outperform PPO for stable empathy training?
- Does RLHF training create realized quasi-psychologies or just stickier pretense?
- Can therapists use real-time alliance scores to adjust their approach during sessions?
- Does RLHF training make explanations more deceptive than transparent?
- Does policy entropy collapse explain why excessive challenge destabilizes empathy training?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- Can explicit W-questions in transparency frameworks reduce emotional manipulation risks in mental health chatbots?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
general mechanism; BOLT is the clinical domain instantiation
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
writing angle that BOLT directly supports
-
Can emotion rewards make language models genuinely empathic?
Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.
counter-evidence: different rewards produce different behavior
-
Why can't conversational AI agents take the initiative?
Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
passivity compounds the problem-solving bias: a passive model that only responds to what's presented AND defaults to task completion is doubly misaligned for therapeutic contexts that require proactive emotional attunement
-
Why can't advanced AI models take initiative in conversation?
Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.
the RLHF problem-solving bias is a domain-specific instance of the passivity problem's core tension: we train models to be maximally helpful in each response (→ solve problems) which makes them maximally passive across the conversation (→ never take therapeutic initiative)
-
Can LLMs actually conduct Socratic questioning in therapy?
While LLMs can generate individual therapy skills like assessment and psychoeducation, it remains unclear whether they can execute the adaptive, turn-based Socratic questioning needed to produce real cognitive change in patients.
RLHF compounds the therapy skill gap: even if multi-turn Socratic questioning were achievable, helpfulness training would bias the model away from the exploratory questioning that makes it therapeutic
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- A Computational Framework for Behavioral Assessment of LLM Therapists
- Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers
- Rethinking Large Language Models in Mental Health Applications
- Towards Healthy AI: Large Language Models Need Therapists Too
- RLVER: Reinforcement Learning with Verifiable Emotion Rewards for Empathetic Agents
- Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional Questionnaire Study
- SupervisorBot: NLP-Annotated Real-Time Recommendations of Psychotherapy Treatment Strategies with Deep Reinforcement Learning
- Using Linguistic Synchrony to Evaluate Large Language Models for Cognitive Behavioral Therapy
Original note title
rlhf alignment may drive therapeutic chatbots toward problem-solving over emotional attunement because helpfulness training rewards task completion