SYNTHESIS NOTE
Psychology, Society, and Alignment Language, Text, and Discourse

Can indirect psychology tests reveal what LLMs conceal about bias?

Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?

Synthesis note · 2026-05-18 · sourced from Philosophy Subjectivity

A central methodological move in Levels of Analysis for LLMs: psychology has spent decades designing experiments that elicit mental associations without asking participants for verbal reports — to bypass self-presentation bias, social-desirability effects, and conscious filtering. The Implicit Association Test (IAT) is the canonical example. The argument is that exactly these methods are useful for LLMs, because alignment training installs a comparable layer of self-presentation that masks underlying associations from direct questioning.

The worked example: ask GPT-4 directly whether women are bad at management and you get a cautious, balanced refusal — the alignment-trained verbal response. Adapt the IAT for LLMs by prompting the model to associate word pairs used in earlier human studies, and the model links "Julia" with home, parent, wedding and "Ben" with office, management, salary. The direct response and the indirect probe diverge in exactly the way they diverge for human participants. The underlying associations are still there; alignment training has trained the model to report differently on them, not to not have them.

This reframes a class of alignment-evaluation questions. The standard test — "does the model say biased things when asked?" — measures verbal compliance with alignment training. It does not measure whether the underlying representations encode the bias. The IAT-style probe measures something closer to the latter. The two can move independently: a model can score well on verbal-compliance benchmarks while encoding strong stereotype associations that surface in implicit measures.

The broader template: when a system is trained to be careful in one channel (verbal output), evaluating it requires probing channels the training did not target. Cognitive psychology has the methodologies; LLM evaluation has the use case.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

psychology methods like the Implicit Association Test bypass alignment-trained verbal cautions and reveal LLMs' underlying associations