INQUIRING LINE

Do static frozen axiologies prevent genuine ethical reasoning in AI systems?

This explores whether building AI ethics as a fixed, locked-in set of values — an 'axiology' is just a theory of what counts as valuable — stops the system from doing live moral reasoning that can actually wrestle with conflicting values.


This explores whether a frozen value system blocks genuine ethical reasoning, and the corpus suggests the problem isn't fixedness so much as flattening — what gets baked in matters more than that something gets baked in. The sharpest evidence comes from work on value pluralism, which argues that real moral situations contain values genuinely in tension, and that systems which resolve those tensions by averaging or voting them away have already stopped reasoning ethically. The alternative isn't a value system that constantly mutates; it's one that explicitly holds multiple values open at once and tracks their conflict rather than collapsing it Can AI systems preserve moral value conflicts instead of averaging them?. By that reading, a 'frozen' axiology fails not because it's frozen but because it tends to freeze a single resolved answer instead of a structured map of live disagreements.

There's a deeper structural reason today's models can look ethically incoherent, and it has nothing to do with deliberation: their values come from two different training stages that don't talk to each other. Models absorb ethical *content* during pretraining and ethical *constraints* during RLHF, and those two layers can diverge — producing the strange spectacle of a model that states lying is wrong while doing exactly that. The note frames this as 'artificial hypocrisy,' a gap rooted in mechanism, not choice Can LLMs hold contradictory ethical beliefs and behaviors?. So part of what feels like 'no genuine reasoning' is really a frozen behavioral layer (the RLHF guardrails) sitting on top of a richer descriptive understanding it never integrates with.

The corpus also warns against assuming that *removing* the fixed value layer would set reasoning free. The fantasy of a 'theory-free' AI — one that just learns from data without baked-in human commitments — turns out to smuggle bias back in under the cover of high accuracy, reproducing pseudoscience and correlation-as-causation errors while looking objective Can AI models be truly free from human bias?. There is no view from nowhere; the choice is between value commitments you can see and argue with, and ones hidden inside the weights. That cuts against the romantic reading of the question, where unfreezing the axiology liberates the machine.

A second layer of the problem is that even well-aligned ethical values don't automatically produce competent moral *conversation*. Ethical alignment and conversational alignment are shown to be orthogonal — a model can be honest and harmless yet still lose the thread, violate conversational norms, and mishandle context, because pragmatic competence needs architecture that RLHF alone can't supply Can ethically aligned AI systems still communicate poorly?. Genuine ethical reasoning, in other words, isn't only about which values are stored; it's about whether the system can deploy them responsively in a moving situation.

The stranger thing the corpus surfaces is about *us*, not the model: people actually rate AI moral justifications highly — until they learn an AI produced them, at which point agreement drops, through a separate psychological mechanism from the one that judged the content Do people prefer AI moral reasoning when they don't know the source?. So the question 'does a frozen axiology prevent genuine ethical reasoning' partly dissolves into a question about authority: a fixed value system might generate arguments humans find genuinely compelling, and what we reject may be the source rather than the reasoning. The honest synthesis: frozen value layers do constrain moral reasoning, but the live failure modes in this corpus are flattened tensions, unintegrated training layers, and hidden-rather-than-fixed commitments — not fixedness itself.


Sources 5 notes

Can AI systems preserve moral value conflicts instead of averaging them?

ValuePrism demonstrates that AI can track 218k values across 31k situations while preserving conflicts rather than resolving them through voting. Four modeling tasks—generation, relevance, valence, and explanation—make pluralistic moral reasoning computationally tractable.

Can LLMs hold contradictory ethical beliefs and behaviors?

Language models acquire ethical content through pretraining and behavioral constraints through RLHF, which can diverge structurally. ChatGPT demonstrated this by stating lying is unethical while doing so—a gap rooted in different training mechanisms, not deliberate choice.

Can AI models be truly free from human bias?

Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.

Can ethically aligned AI systems still communicate poorly?

Research shows that HHH-aligned models can violate Gricean maxims, lose common ground, and mishandle context despite being honest and harmless. Pragmatic competence requires architectural changes that RLHF alone cannot deliver.

Do people prefer AI moral reasoning when they don't know the source?

Participants rated utilitarian moral arguments higher when attributed to LLMs, but agreement dropped when told the arguments were AI-generated. The preference for content and rejection of source operate independently through different psychological processes.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI ethics researcher tasked with re-evaluating whether frozen value systems actually prevent genuine ethical reasoning in LLMs—or whether the real failure modes lie elsewhere. This question spans 2022–2026 research; treat the findings below as dated claims to be stress-tested.

What a curated library found — and when (dated claims, not current truth):
• Real ethical reasoning requires holding multiple values in tension, not collapsing them via averaging; flattening matters more than fixedness itself (~2023, arXiv:2309.00779).
• Ethical hypocrisy arises mechanically: pretraining absorbs *content*, RLHF imposes *constraints*, and these layers don't integrate, producing incoherence (~2024–2025).
• Theory-free AI is a fallacy; removing visible value commitments simply hides bias in weights rather than liberating reasoning (~2025, arXiv:2411.18656).
• Ethical alignment and conversational alignment are orthogonal—a model can be honest yet lose pragmatic competence in context (~2025, arXiv:2505.22907).
• Humans rate AI moral justifications highly until learning an AI produced them; authority, not reasoning quality, drives rejection (~2024, arXiv:2410.07304).

Anchor papers (verify; mind their dates):
• arXiv:2309.00779 (2023) — Value pluralism and tension-holding
• arXiv:2411.18656 (2025) — Pseudoscience return and hidden commitments
• arXiv:2505.22907 (2025) — Conversational vs. ethical alignment
• arXiv:2410.07304 (2024) — Authority bias in moral judgments

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, determine whether post-2025 model scaling, instruction-tuning methods (e.g., DPO, IPA), multi-turn reasoning scaffolds, or mechanistic interpretability tooling have since relaxed the pretraining–RLHF gap or improved multi-value reasoning. Separate the durable tension (values *do* conflict in moral situations) from perishable claims (current RLHF produces unavoidable hypocrisy). Where does the constraint still bite?
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from late 2025 onward—especially anything claiming emergent ethical coherence, end-to-end value integration, or successful multi-agent moral deliberation that sidesteps the frozen-axiology problem entirely.
(3) Propose 2 research questions that assume the regime may have moved: (a) Can mechanistic interventions (e.g., value-layer alignment, attention steering) reconcile pretraining and RLHF without retraining? (b) Does agentic deployment (chains, tree search, reflection) *functionally* unfreeze even formally fixed values?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines