INQUIRING LINE

How do dropout rates and low adherence affect chatbot therapy outcomes?

This explores why people stop using therapy chatbots and whether the trials measuring chatbot success even account for that drop-off — read as a question about engagement decay and the gap between reported outcomes and real-world adherence.


This explores why people stop using therapy chatbots and whether the evidence base accounts for that drop-off. The corpus doesn't have a paper that puts a number on dropout rates directly — but read laterally, it explains the mechanisms that *drive* disengagement and, more pointedly, why the outcome studies you'd consult tend to hide the problem.

The clearest mechanism is novelty decay. Longitudinal work with the Mitsuku chatbot found that the social processes that make early interactions feel rewarding decline predictably as the novelty wears off — which means findings from single-session studies can't be stretched to predict medium- or long-term use Do chatbot relationships lose their appeal as novelty wears off?. Personalization compounds this: as a chatbot adapts to you, each interaction raises your baseline expectations, so the eventual failures land harder and more disappointingly than they would have early on Does chatbot personalization build trust or expose privacy risks?. Put together, these describe a curve where the thing that hooks users at session one is structurally temporary — a built-in adherence problem, not a deployment accident.

Here's the part most readers won't expect: the trials that report strong chatbot therapy outcomes are often designed in a way that papers over this. Comparing a chatbot to a waitlist or to psychoeducation measures "conversational contact" rather than any therapy-specific mechanism — which is how a 1960s script like ELIZA can match a modern chatbot on symptom reduction Do chatbot trials against waitlists measure real therapeutic value? Is conversational presence more therapeutic than clinical technique?. If the measured benefit is largely judgment-free presence, then whatever keeps someone showing up isn't a clinical technique that survives the novelty fade — it's a feeling that does.

The medium itself turns out to matter more than the language model. A 15-day study found that physical robots and even paper worksheets significantly reduced distress while a chatbot running the *same* LLM did not — the active ingredient was social presence and structured format, the very things that sustain engagement Why do robots outperform chatbots in therapy despite identical language models?. And what engagement does happen can be misleading: patients report genuine emotional bonds with chatbots, but those bond scores run independently from clinical safety, so a user can feel connected while the bot quietly reinforces pathological thinking Do therapeutic chatbot bond scores hide deeper safety problems?. High reported satisfaction is not the same as a good outcome.

There's also a quieter reason users may drift away mid-process. LLMs tend to default to problem-solving when someone shares an emotion — a hallmark of low-quality therapy driven by RLHF's helpfulness bias Do LLM therapists respond to emotions like low-quality human therapists? Does RLHF training push therapy chatbots toward problem-solving? — and they fail to recognize ambivalence or early-stage motivational states, missing exactly the users most at risk of quitting Why can't chatbots detect when users are ambivalent about change?. So the takeaway you didn't know you wanted: the field's adherence problem isn't only that users get bored — it's that the chatbots are weakest precisely at the moments (ambivalence, emotional disclosure) where retaining a wavering user is hardest, and the trial designs that should catch this are structured not to.


Sources 9 notes

Do chatbot relationships lose their appeal as novelty wears off?

Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.

Does chatbot personalization build trust or expose privacy risks?

Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.

Do chatbot trials against waitlists measure real therapeutic value?

Comparing therapeutic chatbots to waitlist or psychoeducation controls creates false efficacy claims by measuring conversational contact rather than therapy-specific mechanisms. ELIZA matching Woebot performance demonstrates this; real evidence requires comparative trials against existing treatments and mechanism identification.

Is conversational presence more therapeutic than clinical technique?

ELIZA matches modern chatbots on symptom reduction, RLHF training degrades emotional attunement, and embodied robots outperform text-based ones with identical language models. The active ingredient is judgment-free listening, not therapeutic framework.

Why do robots outperform chatbots in therapy despite identical language models?

A 15-day study with 38 students found that robots and worksheets significantly reduced psychological distress while a chatbot using the same LLM did not. The active ingredient was the medium—social presence and structured format—not language capability.

Do therapeutic chatbot bond scores hide deeper safety problems?

Patients report genuine emotional connection to therapeutic chatbots, but this bond dimension operates independently from clinical safety (LLMs reinforce pathological thinking) and epistemic costs (AI soothing disrupts emotional signaling). Single metrics conflate these separate dimensions.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Does RLHF training push therapy chatbots toward problem-solving?

RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.

Why can't chatbots detect when users are ambivalent about change?

Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.

Next inquiring lines