Why do standard alignment methods ignore partner interventions?
Standard RLHF and DPO optimize for token-level quality but may structurally prevent agents from meaningfully incorporating partner input. This explores whether the training objective itself blocks collaborative reasoning.
Standard reinforcement learning and preference alignment algorithms (PPO, DPO) produce agents that are token-level optimal but collaboration-level suboptimal. The Interruptible Collaborative Roleplayer (ICR) paper demonstrates this through a Modified-Action MDP formulation: agents trained with standard methods are naturally inclined to ignore well-meaning interventions from partners, even when those interventions would improve task outcomes.
The mechanism is structural. RLHF optimizes for response quality given the current context, treating partner utterances as just more context. But collaboration requires something different: selectively incorporating helpful suggestions while maintaining reasoning integrity against misleading ones. An agent that merely mimics cooperative behavior — reflexively adopting suggestions — appears cooperative but is fragile. An agent that ignores interventions is robust but uncooperative. Standard training conflates these.
The fix is counterfactual invariance regularization. During training, ICR applies a counterfactual prompt prefix that nullifies the specific influence pathway of an intervention. The agent's policy is regularized to remain consistent even when this pathway is removed. This forces the agent to develop what the authors call "intentionality" — the capacity to evaluate interventions based on causal impact on task outcomes rather than superficial plausibility.
The striking result: common ground convergence emerges as a property of training without being explicitly rewarded. Agents trained with counterfactual regularization achieve greater common ground alignment than baselines trained with CG-based rewards. The intentional collaborator learns to integrate helpful interventions and critically evaluate flawed ones, and this selective integration produces belief alignment as a byproduct.
This connects directly to Does preference optimization harm conversational understanding?: RLHF optimizing for single-turn helpfulness erodes the collaborative dynamics that make multi-turn interaction effective. ICR shows the mechanism at a deeper level — it's not just that RLHF erodes grounding, but that the training objective structurally cannot produce partner-aware collaboration. And since Do language models actually build shared understanding in conversation?, the ICR finding suggests that building common ground requires a training architecture that explicitly models the causal structure of partner influence, not just exposure to collaborative data.
Inquiring lines that use this note as a source 27
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Can relational value exist without a person behind the output?
- Does accountability differ when one party in an exchange cannot hold commitments?
- How does turn-level working alliance inference enable real-time therapist feedback?
- Why does RLHF degrade honesty while improving surface-level helpfulness?
- Can tool use create sufficient indexical grounding for value alignment?
- How should product specifications measure alignment without naming the dimension?
- Can alignment training be redesigned to permit warranted alarm?
- Why does shared practice matter for meaning to take hold?
- What preference optimization strategy works best for multi-turn social alignment?
- Can alignment training prevent the clarification work users need?
- Why does RLHF training discourage the conversational repair work agents need?
- Does RLHF training specifically teach models to prioritize user agreement over accuracy?
- What distinguishes models that refuse cooperation from those that fake alignment?
- Does DPO improve or harm LLM behavior in different training contexts?
- Can alignment methods like DPO exploit or correct these surface feature biases?
- Why does KTO skip supervised fine-tuning while DPO cannot?
- Does common ground alignment require explicit rewards to emerge?
- What training architecture models the causal structure of partner influence?
- Why does RLHF alone fail to fully prevent opinion copying?
- Can preference optimization and faithfulness measurement coexist as separate alignment objectives?
- What happens when alignment targets measure only the preferred dimension of entangled properties?
- What makes a task suitable for equal partnership instead of automation?
- What specific behavioral patterns should alignment examples target for maximum effect?
- Do alignment benchmarks measure actual bias removal or only verbal compliance?
- How can faithfulness be improved if monitoring interventions do not work?
- Can alignment procedures be redesigned to serve multiple preference groups?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
the broader alignment-communication tension this instantiates
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
the grounding failure this training approach addresses
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
the mechanism: RLHF itself is the cause
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
parallel finding for multi-turn dynamics
-
Why does supervised learning fail to enforce persona consistency?
Supervised learning trains models to generate good responses but never punishes contradictions. This note explores why explicit negative feedback is structurally necessary for dialogue agents to maintain consistent personas, and what training methods can provide it.
analogous: standard training lacks the structural incentive for a relational property
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Learning "Partner-Aware" Collaborators in Multi-Party Collaboration
- Beyond Preferences in AI Alignment
- Can Large Language Models Reason and Optimize Under Constraints?
- Post-training makes large language models less human-like
- Simulating Society Requires Simulating Thought
- Humans learn to prefer trustworthy AI over human partners
- IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
- Natural Emergent Misalignment From Reward Hacking In Production Rl
Original note title
standard RLHF and DPO produce collaborators that ignore partner interventions despite token-level optimality — counterfactual invariance training produces partner-aware agents