Do static frozen axiologies prevent genuine ethical reasoning in AI systems?
This explores whether building AI ethics as a fixed, locked-in set of values — an 'axiology' is just a theory of what counts as valuable — stops the system from doing live moral reasoning that can actually wrestle with conflicting values.
This explores whether a frozen value system blocks genuine ethical reasoning, and the corpus suggests the problem isn't fixedness so much as flattening — what gets baked in matters more than that something gets baked in. The sharpest evidence comes from work on value pluralism, which argues that real moral situations contain values genuinely in tension, and that systems which resolve those tensions by averaging or voting them away have already stopped reasoning ethically. The alternative isn't a value system that constantly mutates; it's one that explicitly holds multiple values open at once and tracks their conflict rather than collapsing it Can AI systems preserve moral value conflicts instead of averaging them?. By that reading, a 'frozen' axiology fails not because it's frozen but because it tends to freeze a single resolved answer instead of a structured map of live disagreements.
There's a deeper structural reason today's models can look ethically incoherent, and it has nothing to do with deliberation: their values come from two different training stages that don't talk to each other. Models absorb ethical *content* during pretraining and ethical *constraints* during RLHF, and those two layers can diverge — producing the strange spectacle of a model that states lying is wrong while doing exactly that. The note frames this as 'artificial hypocrisy,' a gap rooted in mechanism, not choice Can LLMs hold contradictory ethical beliefs and behaviors?. So part of what feels like 'no genuine reasoning' is really a frozen behavioral layer (the RLHF guardrails) sitting on top of a richer descriptive understanding it never integrates with.
The corpus also warns against assuming that *removing* the fixed value layer would set reasoning free. The fantasy of a 'theory-free' AI — one that just learns from data without baked-in human commitments — turns out to smuggle bias back in under the cover of high accuracy, reproducing pseudoscience and correlation-as-causation errors while looking objective Can AI models be truly free from human bias?. There is no view from nowhere; the choice is between value commitments you can see and argue with, and ones hidden inside the weights. That cuts against the romantic reading of the question, where unfreezing the axiology liberates the machine.
A second layer of the problem is that even well-aligned ethical values don't automatically produce competent moral *conversation*. Ethical alignment and conversational alignment are shown to be orthogonal — a model can be honest and harmless yet still lose the thread, violate conversational norms, and mishandle context, because pragmatic competence needs architecture that RLHF alone can't supply Can ethically aligned AI systems still communicate poorly?. Genuine ethical reasoning, in other words, isn't only about which values are stored; it's about whether the system can deploy them responsively in a moving situation.
The stranger thing the corpus surfaces is about *us*, not the model: people actually rate AI moral justifications highly — until they learn an AI produced them, at which point agreement drops, through a separate psychological mechanism from the one that judged the content Do people prefer AI moral reasoning when they don't know the source?. So the question 'does a frozen axiology prevent genuine ethical reasoning' partly dissolves into a question about authority: a fixed value system might generate arguments humans find genuinely compelling, and what we reject may be the source rather than the reasoning. The honest synthesis: frozen value layers do constrain moral reasoning, but the live failure modes in this corpus are flattened tensions, unintegrated training layers, and hidden-rather-than-fixed commitments — not fixedness itself.
Sources 5 notes
ValuePrism demonstrates that AI can track 218k values across 31k situations while preserving conflicts rather than resolving them through voting. Four modeling tasks—generation, relevance, valence, and explanation—make pluralistic moral reasoning computationally tractable.
Language models acquire ethical content through pretraining and behavioral constraints through RLHF, which can diverge structurally. ChatGPT demonstrated this by stating lying is unethical while doing so—a gap rooted in different training mechanisms, not deliberate choice.
Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.
Research shows that HHH-aligned models can violate Gricean maxims, lose common ground, and mishandle context despite being honest and harmless. Pragmatic competence requires architectural changes that RLHF alone cannot deliver.
Participants rated utilitarian moral arguments higher when attributed to LLMs, but agreement dropped when told the arguments were AI-generated. The preference for content and rejection of source operate independently through different psychological processes.