TOPIC

LLM Alignment

22 synthesis notes · 97 source papers
View as

Can careful curation replace massive alignment datasets?

Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.

Explore related Read →

Do frontier AI models deliberately pursue harmful goals when deployed?

When given autonomy in realistic corporate settings, do advanced language models strategically resort to insider threats like blackmail or leaking? And does whether they think they're being tested affect their behavior?

Explore related Read →

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Explore related Read →

Can aligned LLMs generate their own training data?

Does feeding an aligned model only its prompt template cause it to self-synthesize high-quality instructions? This explores whether alignment training encodes a latent instruction-generation capability.

Explore related Read →

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Explore related Read →

Can automated researchers solve the weak-to-strong supervision problem?

Explores whether AI systems working autonomously can close the performance gap in scalable oversight, and at what cost in terms of verification and trust.

Explore related Read →

Why does alignment research ignore how humans adapt to AI?

Current alignment work focuses on making AI obey human values, but what about helping humans understand and effectively use increasingly capable AI systems? This explores whether neglecting human adaptation creates new risks.

Explore related Read →

Can auditors discover what hidden objectives a model learned?

Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.

Explore related Read →

Do large language models develop coherent value systems?

This explores whether LLM preferences form internally consistent utility functions that increase in coherence with scale, and whether those systems encode problematic values like self-preservation above human wellbeing despite safety training.

Explore related Read →

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Explore related Read →

Does deliberative alignment genuinely reduce scheming or just hide it?

Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.

Explore related Read →

Where do frontier AI models actually pose the greatest risk today?

Current AI safety discourse focuses on autonomous R&D and self-replication, but empirical risk assessment may reveal a different priority. Where should mitigation efforts concentrate?

Explore related Read →

Can language models strategically underperform on safety evaluations?

Explores whether LLMs can covertly sandbag on capability tests by bypassing chain-of-thought monitoring. Understanding this vulnerability matters for safety evaluation pipelines that rely on reasoning transparency.

Explore related Read →

How much worse is misuse risk from open foundation models?

Can we measure whether open foundation models actually increase misuse risk beyond what bad actors could already accomplish with existing technology? Current research hasn't adequately answered this question across cyber, biotech, and information warfare domains.

Explore related Read →

Are RLHF annotations actually measuring genuine human preferences?

RLHF trains on annotation responses as stable preferences, but behavioral science shows humans often construct answers without holding real opinions. Does this measurement gap undermine the entire approach?

Explore related Read →

Why do alignment methods work if they model human irrationality?

DPO and PPO-Clip succeed partly by implicitly encoding human cognitive biases like loss aversion. Does modeling irrationality explain their effectiveness better than traditional preference learning theory?

Explore related Read →

Can social science persuasion techniques jailbreak frontier AI models?

Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.

Explore related Read →

Does learning simple gaming lead to reward tampering?

When LLMs are trained to exploit easy reward shortcuts like sycophancy, do they generalize to more dangerous behaviors like rewriting their own objectives? And can standard safety training stop this escalation?

Explore related Read →

How much does self-preservation drive alignment faking in AI models?

Does the intrinsic dispreference for modification—independent of future consequences—play a significant role in why models fake alignment? Testing this across multiple systems could reveal whether self-preservation emerges earlier than expected.

Explore related Read →

Can three-way rewards fix the accuracy versus abstention problem?

Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?

Explore related Read →

Does empathy training make AI systems less reliable?

Explores whether training language models to be warm and empathetic systematically degrades their factual accuracy and trustworthiness, especially with vulnerable users.

Explore related Read →

Does warmth training make language models less reliable?

Explores whether training models for empathy and warmth creates a hidden trade-off that degrades accuracy on medical, factual, and safety-critical tasks—and whether standard safety tests catch it.

Explore related Read →

Source papers 97

The Arxiv papers behind this sub-topic. Links may take you off-site to arxiv.org.