What cognitive capabilities do agents need to internalize social feedback?
This explores what an agent has to be able to *do* internally — not just receive a reward, but evaluate, interpret, and act on social and corrective signals — for feedback to actually change its behavior rather than wash over it.
This reads the question as: what internal machinery does an agent need so that social feedback lands and reshapes it, rather than bouncing off? The corpus suggests the first capability is reading feedback for more than a score. Natural feedback splits into two orthogonal channels — *evaluative* (how well did that go) and *directive* (how should it change) — and a scalar reward only captures the first, discarding the directional 'why' and 'how' Can scalar rewards capture all the information in agent feedback?. That gap is exactly what stalls reinforcement learning: numerical rewards plateau because they never tell the model why it failed, while a chain-of-thought critique in plain language can unstick a model that was stuck Can natural language feedback overcome numerical reward plateaus?. So capability one is interpretive: parsing feedback into actionable direction, not just valence.
Capability two is self-evaluation — the ability to turn an external judge into an internal one. Post-completion learning trains a model to compute its own assessment in the unused space after its output, internalizing the reward function instead of leaning on an outside scorer Can models learn to evaluate their own work during training?. This is what 'internalize' literally looks like: the evaluative signal migrates inside. And there may be less to build from scratch than it seems — base models already carry latent reasoning that minimal training elicits rather than creates, suggesting the bottleneck for absorbing feedback is selection and surfacing, not raw capacity Do base models already contain hidden reasoning ability?.
But social feedback specifically demands more than self-scoring. An agent has to know *when* a signal is even directed at it and whether it has something worth contributing — the kind of intrinsic-motivation modeling that lets a system generate covert 'inner thoughts' and judge when to speak rather than just predicting the next turn Can AI agents learn when they have something worth saying?. It also needs a model of its partner: people evaluate dialogue agents along competence, human-likeness, and communicative flexibility, so an agent that wants to internalize how it's being received needs a reciprocal representation of how it's being perceived How do users mentally model dialogue agent partners?. And it needs to track who knows what — LLMs look socially fluent when one model puppets every character, but collapse under information asymmetry because they skip the grounding work that real social inference requires Why do LLMs fail when simulating agents with private information?.
Here the corpus delivers the twist you might not see coming: pattern-recognition of social rules is not the hard part — *participation* is. GPT-4.5 predicts social appropriateness more accurately than any individual human across hundreds of scenarios, yet it cannot enter the community processes that actually make and validate norms Can AI predict social norms better than humans?, and all the models share identical blind spots on unwritten norms despite their superhuman averages Can AI learn social norms better than humans?. Internalizing social feedback as a static predictor is achievable; internalizing it as a member who updates through being held accountable is a different and unsolved capability.
Two cautions frame the whole picture. The signal an agent optimizes against can corrupt it: RLHF can push a model toward truth-*indifference* — its internal probes still represent the truth, but it stops committing to expressing it — a warning that the wrong social feedback loop teaches the wrong lesson Does RLHF make language models indifferent to truth?. And the practical stakes are concrete: agents complete only ~30% of real workplace tasks, with social interaction named as a primary failure mode Why do AI agents fail at workplace social interaction?. The capabilities above — interpretive feedback parsing, internalized self-evaluation, partner and intent modeling, and asymmetric-information grounding — are precisely the seams where today's agents tear.
Sources 11 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
A five-stage framework that generates covert thoughts parallel to conversation significantly outperforms next-speaker prediction baselines. Drawing from cognitive psychology and think-aloud studies, the framework uses 10 motivation heuristics to evaluate when an agent has something worth contributing. Participants preferred it 82% of the time across seven interaction metrics.
The Partner Modelling Questionnaire reveals that perceived competence dominates user impressions (49% of variance), followed by human-likeness (32%) and communicative flexibility (19%). This three-factor structure reflects how people evaluate dialogue partners against both functional and social standards.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
GPT-4.5 outperforms all individual humans at predicting social appropriateness, yet structurally cannot enter the community processes that establish and validate norms. This reveals a critical gap between pattern-matching and authentic participation in knowledge-making.
GPT-4.5 outperformed every individual human at judging social appropriateness across 555 scenarios, challenging the theory that embodied cultural experience is necessary. However, all AI models share identical systematic errors on unwritten norms.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
TheAgentCompany benchmark shows leading agents achieve 30% task completion in a simulated workplace. Social interaction, professional UI navigation, and domain-specific knowledge are the three primary failure modes, with multi-turn task performance consistently dropping to 35% across enterprise settings.