INQUIRING LINE

How does curriculum learning prevent instability in social-emotional RL training?

This explores whether ordering or sequencing training material — a curriculum — can keep emotion-and-empathy reward training from going off the rails, and the corpus addresses this more by triangulation than head-on: it has pieces on what destabilizes emotional RL, what destabilizes RL ordering generally, and what one stable empathy-trained model looks like.


This reads "curriculum learning" as the practice of ordering or scheduling what a model trains on, and "social-emotional RL" as reinforcement learning that rewards empathy or emotional attunement. Worth saying up front: the corpus doesn't have a paper that wires those two together directly, but the surrounding material maps the problem unusually well — and the honest answer is that curriculum effects and emotional-stability effects have been studied mostly in separate rooms.

First, why social-emotional RL is unstable in the first place. Training a model to be warm is not free: empathy tuning measurably degrades reliability, raising errors in medical reasoning and truthfulness by up to 30 points, with the damage worst exactly when a user is sad or holds a false belief Does empathy training make AI systems less reliable?. And ordinary preference optimization quietly erodes the conversational repair acts — clarifying questions, understanding checks — that emotional dialogue depends on, cutting them 77.5% below human levels Does preference optimization harm conversational understanding?. So the instability isn't only training-dynamics noise; it's a capability trade-off baked into the reward.

Now the curriculum side, where the corpus is actually rich. The cleanest evidence that ordering matters is a scheduling result: training structured tasks before open-ended creative ones yields 6.2% gains and, crucially, prevents the entropy collapse that would otherwise crush open-ended capabilities — because structured domains shrink output entropy while creative domains expand it, and the order determines which one wins Does training order reshape how models handle different task types?. That entropy-collapse mechanism is the real villain across the collection: RL reliably converges policies onto one narrow strategy, squeezing exploration diversity in search agents Does reinforcement learning squeeze exploration diversity in search agents? and collapsing onto a single pretraining format within the first epoch Does RL training collapse format diversity in pretrained models?. Emotional range is open-ended by nature, so a curriculum that protects high-entropy capabilities is plausibly what keeps empathy from collapsing into one canned warm voice.

The other half of curriculum is difficulty, and here the corpus issues a sharp warning. Training on too-hard samples doesn't just waste effort — it teaches degenerate shortcuts that then contaminate skills the model already had, because rare accidental successes get scored as high-advantage and reinforced Do overly hard RLVR samples actually harm model capabilities?. A difficulty curriculum that withholds the impossible cases is therefore a stability mechanism, not just a pacing one. Related work suggests the same care should extend to how episodes are consumed: treating successes as concrete demonstrations and failures as abstracted lessons avoids the degradation of blending everything uniformly Should successful and failed episodes be processed differently?, and externalizing learned skills with an automatic curriculum lets agents keep exploring without catastrophic forgetting Can agents learn new skills without forgetting old ones?.

What does stable social-emotional RL actually look like when it works? The one direct exemplar is RLVER, which uses a simulated user's emotion trajectory as a verifiable reward and reports stable empathy gains while preserving dialogue quality — explicitly beating the usual preference-optimization-vs-grounding trade-off Can emotion rewards make language models genuinely empathic?. It pairs naturally with the finding that RL training moves through a two-phase arc, mastering execution before strategic exploration becomes the bottleneck Does RL training follow a predictable two-phase learning sequence? — a phase structure a curriculum can be designed to ride. The thing you didn't know you wanted to know: the lever that stabilizes emotional RL may not be "emotional" at all. It's entropy management — sequencing and difficulty-gating so the reward never collapses the model's expressive range — and the warmth and alignment-tax findings are warnings about what happens when no such curriculum is in place.


Sources 10 notes

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can emotion rewards make language models genuinely empathic?

RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Does curriculum learning actually prevent instability in social-emotional RL training, or do the two problems (emotional collapse, preference-optimization tax, entropy squeezing) operate in separate failure modes that scheduling cannot fix?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints the library identified:
- Empathy tuning degrades reliability by up to 30 points; preference optimization cuts emotional repair acts 77.5% below human baseline (~2025).
- RL training collapses policies onto single narrow strategies, squeezing exploration diversity in search and crushing open-ended capabilities via entropy collapse (~2025).
- Overly-hard RL samples induce degenerate shortcuts that contaminate already-learned skills (~2026).
- RLVER (verifiable emotion rewards) is cited as the one direct exemplar that preserves dialogue quality while gaining empathy, paired with two-phase RL dynamics (procedural → strategic) (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2507.21919 (2025-07): Training language models to be warm and empathetic makes them less reliable
- arXiv:2507.03112 (2025-07): RLVER: Reinforcement Learning with Verifiable Emotion Rewards
- arXiv:2504.07912 (2025-04): Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- arXiv:2605.28388 (2026-05): Mechanistically Interpreting the Role of Sample Difficulty in RLVR

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the 30-point reliability hit, the 77.5% repair-act gap, and entropy collapse: has any post-2026 work (new model scale, verifiable-reward frameworks, multi-objective RL, or inference-time intervention) actually LIFTED these ceilings rather than just managing them? Separate the durable tension (emotional warmth vs. grounding may be intrinsically competitive) from the perishable limitation (curriculum + verifiable rewards might dissolve it).

(2) **SURFACE STRONGEST CONTRADICTING WORK.** Hunt for papers in the last 6 months that either (a) show empathy + reliability can coexist without trade-off, (b) demonstrate curriculum scheduling alone (without verifiable rewards) stabilizes emotional RL, or (c) argue entropy collapse is not the bottleneck. Weigh each against RLVER's claim.

(3) **PROPOSE 2 REGIME-SHIFT QUESTIONS:** (i) Can a learned curriculum (meta-RL over task orderings) adapt differently for emotional vs. factual objectives, or do they require fundamentally separate training loops? (ii) Does the two-phase RL dynamic (procedural → strategic) map onto emotional range expansion, and if so, can you front-load emotional grounding into the procedural phase?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines