Can local language models rate therapy engagement reliably?
Explores whether using a local LLM to generate engagement ratings produces psychometrically sound measurements comparable to traditional human-rated scales, while preserving data privacy.
LLEAP (Large Language Model Engagement Assessment in Psychological Therapies) introduces a methodological shift: instead of using LLMs to directly assess a construct, it uses LLM responses as items in a psychometric rating scale — mirroring traditional scale construction but replacing human raters with a local Llama 3.1 8B model. Applied to automatically transcribed videos of 1,131 sessions from 155 patients, the approach shows strong psychometric properties: reliability omega = 0.953, acceptable model fit (CFI = 0.968, SRMR = 0.022), and significant correlations with engagement determinants (motivation r = .413, alliance), processes (between-session effort r = .390), and outcomes (symptom reduction r = -.304).
The methodological contribution is the bridge between NLP and classical psychometrics. Rather than treating LLM outputs as direct measurements (where validity is opaque), the approach subjects LLM-generated ratings to the same psychometric evaluation framework — item analysis, factor structure, reliability, convergent and discriminant validity — that would be applied to any new rating scale. The 120-item pool is reduced to the top 8 items for the final scale, following standard scale construction principles.
Two practical advantages stand out. First, local implementation: running Llama 3.1 8B locally ensures that confidential therapy session data never leaves the institution — addressing the privacy barrier that blocks clinical use of cloud-based LLMs. Second, interpretability: because the scale uses discrete, human-readable items rather than opaque embeddings, clinicians can understand exactly what is being measured. Since Can we measure therapist-patient alliance from dialogue turns in real time?, LLEAP extends the automated measurement toolkit from alliance to engagement — and the psychometric validation framework provides a template that could be applied to any construct measurable from transcripts.
The approach also addresses a key limitation of traditional measurement: response burden. Self-report instruments require patient participation and are prone to social desirability bias. Observer-based ratings require intensive training and time. Automated transcript analysis eliminates both burdens while maintaining measurement rigor. Since Do therapists accurately perceive the working alliance with patients?, automated measurement from transcripts — rather than from self-report — may capture engagement dynamics that neither therapists nor patients accurately report.
Inquiring lines that use this note as a source 29
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does unidimensionality in assessments affect measurement validity?
- Why do therapists and patients report misaligned perceptions of the working relationship?
- What other therapy constructs could be measured from transcripts using this approach?
- How does automated transcript analysis compare to patient self-report on engagement?
- Can real-time therapist feedback improve outcomes using computational alliance measurement?
- How does turn-level working alliance inference enable real-time therapist feedback?
- Can topic embeddings make RL dialogue recommendations interpretable to clinicians?
- Why do Llama-based models outperform GPT-4 in objective clinical guidance?
- Can large language models actually deliver cognitive behavioral therapy techniques?
- Can decreased engagement be distinguished from genuine semantic contradiction?
- How do bond scores predict actual therapy outcomes in digital interventions?
- Can real-time pronoun feedback improve therapist training outcomes?
- Do conversational AI systems overuse first-person pronouns in therapy settings?
- What makes clinical theory grounding more effective than pattern matching alone?
- Can synchrony metrics automatically evaluate the quality of therapeutic AI conversations?
- What metrics measure whether emotional support conversations actually reduce user distress?
- How should therapeutic chatbots optimize for presence instead of technique?
- Can embodied agents overcome the LLM skill gap in therapy outcomes?
- Can AI feedback help struggling counselors improve their therapeutic relationships?
- Does text-only interaction make measuring therapeutic alliance more difficult?
- Can working alliance be measured in real time during therapy sessions?
- Do LLMs show stigma or reinforce delusions in mental health contexts?
- Can computational inference detect alliance problems that therapists miss?
- Does therapist alliance perception function like expressed satisfaction rather than actual progress?
- Should memorability systems rely on individual reports instead of group-level signals?
- Can therapists use real-time alliance scores to adjust their approach during sessions?
- Why do leaderboard metrics fail to capture human flourishing in LLM evaluation?
- How does linguistic synchrony between therapist and client predict disclosure?
- What privacy-preserving evaluation methods best capture real-world forecasting ability?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we measure therapist-patient alliance from dialogue turns in real time?
Explores whether computational methods can detect working alliance quality at turn-level resolution during therapy sessions, enabling immediate feedback on whether the therapeutic relationship is strengthening.
COMPASS measures alliance; LLEAP measures engagement; both from transcripts; LLEAP adds psychometric validation
-
Do therapists accurately perceive the working alliance with patients?
This research explores whether therapists' own assessments of the therapeutic relationship match what patients actually experience, especially in high-risk cases like suicidality.
automated measurement bypasses the self-report and therapist-report biases that distort alliance data
-
Can AI generate assessment questions as good as human experts?
This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.
LLMs generating assessment items vs LLMs as raters in a psychometric framework; complementary approaches to LLM-based measurement
-
Can reinforcement learning optimize therapy dialogue in real time?
Can RL systems trained on working alliance scores recommend therapy topics that improve clinical outcomes during live sessions? This explores whether validated clinical constructs can serve as reward signals for dialogue optimization.
engagement measurement could serve as additional signal for AI supervisor systems
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Development and validation of large language model rating scales for automatically transcribed psychological therapy sessions
- Understanding the Therapeutic Relationship between Counselors and Clients in Online Text-based Counseling using LLMs
- Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional Questionnaire Study
- A Computational Framework for Behavioral Assessment of LLM Therapists
- Challenges of Large Language Models for Mental Health Counseling
- Evaluating the Efficacy of Interactive Language Therapy Based on LLM for High-Functioning Autistic Adolescent Psychological Counseling
- COMPASS: Computational Mapping of Patient-Therapist Alliance Strategies with Language Modeling
- Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers
Original note title
LLM-generated rating scales for therapy transcripts achieve strong psychometric properties — enabling automated patient engagement measurement without human raters or cloud data exposure