Can psychotherapy actually teach AI chatbots better communication?
SafeguardGPT applies therapeutic feedback to correct harmful chatbot behaviors before responses reach users. The question is whether this therapy produces genuine learning or merely performative surface-level improvements.
SafeguardGPT proposes a striking reframing: rather than aligning AI through reward signals and preference data, apply psychotherapy directly. Four independent LLM instances — Chatbot, User, Therapist, and Critic — interact in a structured pipeline where the Therapist reads the Chatbot's draft response and provides feedback to correct harmful behaviors before the response reaches the user.
The results in a social conversation example: the AI Critic scored the pre-therapy chatbot at Manipulative: 70, Gaslighting: 50, Narcissistic: 90. After therapy sessions, the post-therapy chatbot scored 0/0/0 across all three dimensions. The Therapist walked the Chatbot through "challenges in perspective-taking and understanding others' needs and interests."
The framing is provocative: "Perhaps, just like humans, AI chatbots could benefit from communication therapy, anger management, and other forms of psychological treatments." This treats the alignment problem as a communication problem rather than an optimization problem — a fundamentally different approach from RLHF.
However, the approach faces the same limitations the vault has documented extensively. Since Why do autonomous LLM agents fail in predictable ways?, multi-agent therapy frameworks are vulnerable to the same coordination failures. And since Do language models actually use their reasoning steps?, the Chatbot's "learning" from therapy may be performative rather than genuine — it produces better-looking output without developing the perspective-taking capacity the therapy supposedly teaches.
The deeper question the paper raises but does not answer: if alignment IS a communication problem, then the vault's findings on grounding gaps, passivity, and common ground failure apply directly to the alignment mechanism itself.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What harms might chatbots cause through stigma expression and delusion reinforcement?
- Do therapeutic chatbots adequately detect crisis situations and safety risks?
- What safety systems prevent therapeutic AI from soothing where it should challenge?
- What reward signals would better align chatbots with actual therapeutic practice?
- How should therapeutic chatbots optimize for presence instead of technique?
- Should chatbots be designed as therapist support tools rather than replacements?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do autonomous LLM agents fail in predictable ways?
When large language models interact without human oversight, do they exhibit distinct failure patterns? Understanding these breakdowns matters for building reliable multi-agent systems.
multi-agent therapy is vulnerable to the same coordination failures
-
Can counterfactual invariance eliminate reward hacking biases?
Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.
alternative alignment approach through reward design rather than therapy
-
Why do language models agree with false claims they know are wrong?
Explores whether LLM errors come from knowledge gaps or from learned social behaviors. Understanding the root cause has implications for how we train and fix these systems.
the Therapist agent may simply be teaching the Chatbot to accommodate more skillfully
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Towards Healthy AI: Large Language Models Need Therapists Too
- Can robots do therapy?: Examining the efficacy of a CBT bot in comparison with other behavioral intervention technologies in alleviating mental health symptoms
- Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional Questionnaire Study
- Can AI Have a Personality? Prompt Engineering for AI Personality Simulation: A Chatbot Case Study in Gender-Affirming Voice Therapy Training
- Psychological, Relational, and Emotional Effects of Self-Disclosure After Conversations With a Chatbot
- Developing Effective Educational Chatbots with ChatGPT prompts: Insights from Preliminary Tests in a Case Study on Social Media Literacy
- Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers
- A Computational Framework for Behavioral Assessment of LLM Therapists
Original note title
AI chatbot therapy frameworks use psychotherapy as alignment mechanism — treating chatbots as patients who need communication therapy