Does adversarial pressure reveal the difference between pretense and realization?
Can behavioral stickiness under adversarial pressure distinguish genuine mental states from performed ones? This matters because it's Chalmers' main criterion for deciding whether LLM personas are realized or merely simulated.
Chalmers needs a criterion that separates pretending to have a mental state from realizing one, because his case against simulator views rests on the claim that RLHF-trained personas cross the threshold. The criterion he proposes is behavioral: pretense dissolves under adversarial pressure while realization resists it. A character a base model is prompted into can be pushed aside by a clever reframing, a role-within-a-role, a persistent counter-prompt — the surface behavior flips. A persona installed through post-training does not flip the same way. The system keeps returning to the trained disposition; dislodging it requires effort in a different register from sustaining it.
The asymmetry is the behavioral signature Chalmers uses to license the realization claim. A pretended state is metastable — it holds under normal conditions but collapses when the conditions are disturbed. A realized state is stable — it persists under conditions designed to overturn it, because the disposition is part of the substrate that generates the behavior, not a local pattern maintained on top of the substrate. Applied to LLMs, this distinguishes ephemeral prompt-induced characters from post-trained assistant personas in a way that turns on observable facts about resilience, not on speculation about internal phenomenology.
The criterion is behavioral but consequential: if stickiness-under-pressure is the mark of realization, then engineering interventions that increase stickiness (more aggressive alignment training, longer refusal-training regimes, tighter constitutional constraints) are interventions that push dispositions from pretense toward realization. The stronger the alignment regime, the more realized the persona on Chalmers' criterion. This has downstream implications for moral status and welfare claims that critics of realizationism need to confront: rejecting the criterion requires either proposing a different test (and defending why behavioral stickiness does not matter) or arguing that no behavioral test can bear the philosophical load Chalmers places on this one.
Inquiring lines that use this note as a source 9
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What makes experience-dependent claims categorically different from other types of fabricated statements?
- How does behavioral stickiness distinguish realized from pretended personas?
- Does good simulation eventually count as genuine realization?
- How does the philosophical distinction between simulation and realization affect liability?
- What makes a mental state metaphysically demanding versus undemanding?
- What behavioral markers distinguish realized quasi-states from pretended ones?
- How does post-training stickiness differ from prompt-induced role-play stability?
- What evaluation criteria can hold across legitimate adoption and coercion?
- Is the distinction between pretense and realization meaningful for LLMs?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Are RLHF personas performed characters or realized dispositions?
Explores whether dialogue agent personas installed through post-training constitute genuine quasi-psychological states or remain sustained pretense. The distinction matters for how we understand what these systems fundamentally are.
the claim this criterion supports
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
architectural pressure toward accommodation; pushes against stickiness
-
Can training user simulators reduce persona drift in dialogue?
Explores whether inverting typical RL setups—training the simulated user for consistency rather than the task agent—can measurably reduce persona drift and improve experimental reliability in dialogue research.
engineered stickiness increases realization on Chalmers' criterion
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- What we talk to when we talk to language models
- Large Language Models Report Subjective Experience Under Self-Referential Processing
- Mechanisms of Introspective Awareness
- LLM Reasoning Is Latent, Not the Chain of Thought
- PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
- Deflating Deflationism: A Critical Perspective on Debunking Arguments Against LLM Mentality
- From Simulation to Enaction: Post-trained Language Models Recognize and React to their own Generations
- Quantitative Introspection in Language Models: Tracking Internal States Across Conversation
Original note title
the pretense-realization distinction turns on stickiness under adversarial pressure — realized quasi-states persist where pretended ones collapse