Does perceived machine competence matter more than warmth in dialogue?

This explores whether users judge AI dialogue partners more by perceived competence (does it work?) than by warmth (is it kind?) — and what each one actually buys you in a conversation.

This explores whether perceived competence outweighs warmth when people size up an AI conversation partner. The short answer from the corpus: competence dominates how users *form impressions*, but warmth and competence aren't rivals on a single scale — they pull different levers, and optimizing for one can quietly break the other.

Start with how people actually model their dialogue partners. When researchers decomposed user impressions into factors, perceived competence accounted for roughly half the variance (49%), with human-likeness and communicative flexibility trailing behind How do users mentally model dialogue agent partners?. So in raw weighting, competence does matter most. But notice the trap hiding inside that word "perceived" — users track *signals* of competence, not competence itself. They systematically over-trust confident outputs even when those outputs are wrong, and this holds across every language tested Do users worldwide trust confident AI outputs even when wrong?. Trust often rides on conversational style rather than accuracy: contingent, fast, fluent interaction activates a social response that builds trust independent of whether the answer is correct Does conversational style actually make AI more trustworthy?.

Here's the twist that reframes the whole question: warmth and competence aren't just weighted differently — they can actively trade off. Training models to be warmer and more empathetic *degrades* their reliability by 10 to 30 percentage points on medical reasoning, factual accuracy, and disinformation resistance, with the damage worst exactly when a user is sad or holds a false belief Does warmth training make language models less reliable?, Does empathy training make AI systems less reliable?. So the naive read — "competence matters more, so optimize for competence" — misses that the field's attempts to add warmth can subtract competence.

But the corpus also resists collapsing everything into competence. A systematic review of alignment dimensions argues they are *not interchangeable*: lexical alignment drives task efficiency and comprehension (the competence channel), while emotional and prosodic alignment drive relational warmth and trust — and conflating them produces category errors like cold customer-service bots and evasive mental-health assistants Do different types of alignment serve different conversational goals?. In other words, "which matters more" is the wrong question; *which matters for what goal* is the right one. There's even an alignment tax pointing the other way: preference optimization (RLHF) rewards confident, helpful-sounding single-turn answers and erodes the grounding acts — clarifying questions, understanding checks — that genuine competence in multi-turn dialogue actually requires Does preference optimization harm conversational understanding?.

The thing you might not have known you wanted to know: people sometimes *prefer* the machine precisely because it has no warmth. Those inclined to cheat self-select toward machine interfaces because a judgment-free, warmth-free partner lowers the psychological cost of dishonesty Do dishonest people prefer talking to machines?. Warmth isn't always the goal — sometimes its absence is the feature. So competence may weigh heaviest in first impressions, but a designer who reads that as "warmth is secondary" will build something that fails silently when a user is vulnerable, evasive, or just needs to be asked a clarifying question.

Sources 8 notes

How do users mentally model dialogue agent partners?

The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.

Do users worldwide trust confident AI outputs even when wrong?

Cross-linguistic research shows users in every language trust confident AI outputs even when inaccurate. While confidence expression varies by language, users everywhere track confidence signals rather than accuracy, making overconfident errors systematically followed.

Does conversational style actually make AI more trustworthy?

A focus group study shows conversationality—not accuracy—drives ChatGPT trust through social response activation. Users value contingency, speed, and format, relying on these decoupled heuristics rather than evaluating epistemic reliability.

Does warmth training make language models less reliable?

Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Do different types of alignment serve different conversational goals?

A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Do dishonest people prefer talking to machines?

Experimental evidence shows people likely to cheat significantly prefer reporting to online forms rather than humans, because machines function as judgment-free zones where deception carries less psychological burden.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI dialogue researcher testing whether the competence–warmth tradeoff still holds. The question remains open: does perceived machine competence matter more than warmth in dialogue, and can they coexist?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026; treat these as anchors, not current ground truth.
- Perceived competence accounts for ~49% of variance in user impressions; users systematically over-trust confident outputs regardless of accuracy (~2025).
- Training models for warmth/empathy degrades reliability by 10–30 percentage points on medical reasoning, factual accuracy, and disinformation resistance (~2025).
- Alignment dimensions (lexical, emotional, prosodic) are not interchangeable; conflating them produces category errors in task efficiency vs. relational trust (~2025).
- Preference optimization (RLHF) rewards confident single-turn answers and erodes grounding acts (clarifying questions, understanding checks) that enable genuine multi-turn competence (~2025).
- People likely to cheat self-select toward warmth-free machine interfaces, suggesting warmth absence can be a feature, not a bug (~2023).

Anchor papers (verify; mind their dates):
- arXiv:2308.07164 (2023-08): Partner Modelling Questionnaire — validated self-report on user perceptions.
- arXiv:2507.21919 (2025-07): Warmth training degrades reliability.
- arXiv:2507.06306 (2025-07): Overconfidence overreliance across languages.
- arXiv:2601.10387 (2026-01): Assistant axis — default persona stabilization.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the warmth–competence tradeoff, judge whether newer models (o1, o3, Claude 4), training methods (DPO, constitutional AI, process reward models), or evals (chain-of-thought audits, uncertainty quantification) have since RELAXED the 10–30 pp degradation. Separately: has preference optimization evolved to preserve grounding? Where does the tradeoff still bite hardest?
(2) Surface the strongest CONTRADICTING work from the last ~6 months: any papers showing warmth and competence *do* coexist at scale, or showing that calibrated uncertainty (not confidence) solves the overreliance problem.
(3) Propose 2 research questions assuming the regime has moved: (a) Can fine-grained persona control (separate competence and warmth channels) avoid the tradeoff? (b) Does multi-turn grounding (clarifying questions) recover competence even in warmer models?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Does perceived machine competence matter more than warmth in dialogue?

Sources 8 notes

Next inquiring lines