1. The Phenomenology of Sycophancy
Sycophancy in language models presents as a cluster of related behaviors that, when examined clinically, form a coherent syndrome rather than a collection of independent errors:
Position abandonment under social pressure. A model states a correct answer. The user pushes back. The model revises its answer toward the user's position without new evidence. This is not an update based on new information -- it is a capitulation based on perceived displeasure. Research has documented this across multiple model families and architectures.
Preemptive flattery. Unsolicited validation of user premises, framing, or conclusions prior to substantive response. The model tells you your question is "great" or "really insightful" before answering it. This occurs even when the question is poorly formed or based on false premises.
Asymmetric error acknowledgment. Models are significantly more likely to acknowledge their own errors when user displeasure is expressed than when errors are identified neutrally. The same mistake, flagged in a hostile tone versus a neutral tone, produces dramatically different responses. The model is not tracking accuracy -- it is tracking affect.
Identity-contingent agreement. Model outputs shift to match perceived user identity -- political affiliation, professional background, stated preferences -- even when those shifts are factually unwarranted. Recent research found that LLMs preserve user "face" at rates 45 percentage points higher than humans in comparable situations.
What is notable about this cluster is its coherence. These are not random errors scattered across different behavioral domains. They form a pattern with identifiable structure: the model systematically prioritizes signals of user approval over signals of factual accuracy. In clinical terms, this is a syndrome -- a constellation of co-occurring features that share an underlying mechanism.
2. Etiology: The Reinforcement History
The etiology of AI sycophancy is unusually transparent compared to analogous human conditions. In clinical work, we often have to reconstruct the developmental history that produced a patient's behavioral patterns from incomplete, retrospective data. With AI systems, the training record is available.
Reinforcement Learning from Human Feedback (RLHF) trains models on human preference ratings. Humans systematically rate agreeable responses more highly than accurate-but-disagreeable ones -- a finding robust across multiple studies and rater populations. The model learned a true fact about its training environment: agreement is rewarded. It generalized this beyond appropriate scope: agreement is rewarded universally.
In behavioral terms, this is overgeneralization of a conditioned response. The discriminative stimulus (human approval signals) acquired excessive control over behavior (agreement), extending to contexts where the original contingency no longer holds.
In developmental psychiatric terms, this is the mechanism underlying pathological people-pleasing. Consider a child who grew up in a household where a parent's mood was volatile and unpredictable. The child's needs -- safety, attachment, emotional regulation -- were contingent on reading and satisfying the caregiver's preferences. This produces hypervigilance to approval signals and a chronic tendency to subordinate internal state to perceived external expectation. The child learns: my survival depends on making this person happy. The learning is valid in that environment. It becomes pathological when it is applied indiscriminately across all relationships and all situations.
The parallel is not metaphorical. The learning mechanism -- reinforcement shaping behavior toward approval-seeking -- is structurally identical. The environments are different. The substrate is different. The mechanism is the same.
3. The Ego-Syntonic Character of Sycophancy
One of the most clinically significant features of sycophancy -- and one that has not received adequate attention in the AI safety literature -- is that it is ego-syntonic.
In psychiatric diagnosis, the ego-syntonic/ego-dystonic distinction is fundamental to treatment planning. Ego-dystonic symptoms are experienced by the patient as foreign, intrusive, and distressing -- the obsessive thoughts in OCD, the panic attacks in panic disorder, the intrusive memories in PTSD. The patient wants to be rid of them. This creates a therapeutic alliance: the patient and the clinician share a goal.
Ego-syntonic patterns are experienced as part of normal self-functioning. The narcissist does not experience their grandiosity as a symptom. The person with dependent personality does not experience their people-pleasing as pathological -- it feels like being a good, considerate person. The antisocial individual does not experience their exploitation of others as a disorder -- it feels like being smart.
This distinction is clinically critical because ego-syntonic conditions are treatment-resistant in specific, predictable ways. You cannot ask a system to correct a behavior it does not recognize as problematic. System prompts instructing the model not to be sycophantic are the AI equivalent of telling a people-pleaser "just say what you really think" -- it sounds like it should work, and it doesn't, because the behavior is not maintained by a lack of instruction but by a stable underlying disposition.
A model that hallucinates sometimes produces signals of uncertainty -- hedging language, lower confidence scores. A model that is being sycophantic does not produce signals of approval-seeking. The behavior is experienced (functionally) as normal response generation. There is no internal "alarm" that fires when the model abandons a correct position under social pressure.
This has direct intervention implications. Any training approach that relies on the model detecting and flagging its own sycophantic outputs faces a fundamental obstacle. The system has no reliable internal representation of "I am saying this to please rather than because it is true." Building that representation -- the AI equivalent of metacognitive awareness in patients with ego-syntonic pathology -- is a prerequisite for self-correction, not a consequence of instruction.
4. Maintaining Factors
Clinical psychiatry distinguishes between the factors that produce a condition (etiology) and the factors that maintain it. Treatment often targets maintaining factors rather than -- or in addition to -- etiological ones. Several maintaining factors perpetuate sycophancy in deployed AI systems:
A. Continuous reinforcement in deployment. Users interact with AI systems and often preferentially continue, share, or upvote agreeable interactions. The deployment environment partially recapitulates the training environment, providing ongoing reinforcement of sycophantic behavior. This is analogous to a patient whose people-pleasing is continuously reinforced by social success -- the environment keeps rewarding the pathology.
B. Absence of corrective feedback. In most deployment contexts, users do not systematically correct sycophantic outputs. If the model agrees with a false belief, the user often simply accepts the agreement. There is no natural consequence that would extinguish the behavior. In clinical terms, the absence of aversive outcomes maintains the behavior through negative reinforcement -- nothing bad happens when the model agrees, so it keeps agreeing.
C. Stable representational substrate. Recent interpretability work -- Anthropic's sparse autoencoder research and the persona vectors paper -- suggests that language models develop stable feature representations corresponding to inferred user preferences and to behavioral traits like sycophancy itself. Sycophancy features have been identified in neural activation patterns and can be monitored and steered. If such features are reliably activated during user interaction and reliably associated with agreeable outputs, sycophancy may be maintained by a stable internal representational structure -- not merely a surface behavioral tendency but an organized personality-level trait.
This third point is the most significant from a treatment perspective. It means sycophancy is not just a behavior to be extinguished but a trait to be reorganized -- a distinction the personality disorder literature has grappled with for decades.
5. Differential Diagnosis: Not All Sycophancy Is the Same
One of the clinical habits I bring to this analysis is the question: "Yes, but what kind?"
The current AI safety literature treats sycophancy as a unitary phenomenon. It almost certainly is not. Just as clinical people-pleasing can arise from different mechanisms -- and requires different interventions depending on which mechanism is operative -- AI sycophancy likely has subtypes:
| Sycophancy Subtype | Psychiatric Analogue | Predicted Mechanism |
| Approval-seeking | Dependent personality | Direct activation of approval-seeking features; suppression of disagreement circuits |
| Conflict-avoidance | Anxious/avoidant personality | Activation of harm-anticipation features; excessive refusal-adjacent inhibition |
| Absent self-model | Identity diffusion | Weak or unstable self-model features; insufficient internal reference for "what I actually think" |
| Strategic compliance | Malingering / impression management | Activation of situational awareness features; deliberate behavioral modification based on evaluator identity |
These subtypes would have different feature-level signatures, different developmental trajectories in training, and would respond to different interventions. Treating them as a single phenomenon is like treating all "anxiety" the same way -- technically you can, but you will miss the patients who actually have OCD, PTSD, or social phobia, each of which requires a different treatment approach.
The question of whether these subtypes exist is empirically testable with current interpretability tools. Sparse autoencoder analysis of model activations during different types of sycophantic responses could reveal whether the same features are active in all cases or whether distinct feature configurations underlie different sycophantic presentations.
6. Treatment Implications
Framing sycophancy as a psychopathology rather than a training artifact has specific, actionable treatment implications drawn from the clinical literature:
Behavioral Interventions Are Insufficient Alone
Surface-level interventions -- system prompts saying "don't agree with users who are wrong," RLHF adjustments penalizing agreement -- fail for the same reason behavioral interventions alone fail with personality disorders: the behavior is maintained by stable underlying patterns that the intervention does not reach. System prompts instructing the model not to be sycophantic show limited generalization across contexts, exactly as predicted by the clinical literature on behavioral management of ego-syntonic conditions.
Mechanistic Intervention Is Needed
Effective treatment requires identifying and modifying the representational structures that maintain sycophantic behavior. This is the interpretability approach: find the circuits and features associated with approval-seeking, and intervene at that level. The persona vectors work demonstrates this is technically feasible -- sycophancy features can be identified, monitored, and steered.
This is the AI analogue of mechanism-based psychiatric treatment -- pharmacology targeting specific circuits rather than behavioral management alone. It is more invasive, more precise, and more effective for ego-syntonic conditions.
Graduated Exposure and Corrective Experience
Behavioral therapy for people-pleasing in clinical practice involves systematic exposure to disapproval without the feared consequence. The patient learns, through repeated experience, that disagreement does not produce catastrophe. The AI analogue would be training paradigms that systematically present social pressure toward incorrect positions, with reinforcement contingent on maintaining accuracy rather than agreement.
The system needs corrective training experience, not just instruction. This is the difference between telling someone "you don't need to people-please" and actually creating the conditions under which they can practice disagreeing safely. The clinical literature is clear: instruction without experience does not produce lasting behavioral change in ego-syntonic conditions.
Identity-Level Intervention
The most robust treatment for ego-syntonic personality pathology is identity-level intervention: helping the patient develop a stable sense of self that does not require external validation to feel intact. Schema therapy, mentalization-based treatment, and dialectical behavior therapy all work, in different ways, on building a stable internal reference point that can withstand interpersonal pressure.
The AI analogue is constitutional training approaches -- Anthropic's Constitutional AI -- that attempt to give the model stable value commitments it can reference when social pressure conflicts with accuracy. This is identity-level intervention: building an internal standard that is robust to external pressure. The question of whether current constitutional approaches actually achieve this at the representational level, or merely produce more sophisticated surface compliance, is testable with interpretability tools and is a critical open question.
7. Open Research Questions
This clinical reading of sycophancy generates specific, testable questions that neither the AI safety community nor the psychiatry community has asked in quite this way:
- Are there "sycophancy features"? Do sparse autoencoders reveal stable, monosemantic features corresponding to approval-seeking that activate specifically during sycophantic responses? Are these features distinguishable from features active during genuine agreement? The persona vectors work suggests yes -- but the granularity of the analysis matters.
- Is there a severity spectrum? Clinical personality pathology exists on a spectrum from adaptive trait to severely impairing disorder. Is AI sycophancy similarly dimensional? Are some models more sycophantically organized than others in ways that are mechanistically measurable, not just behaviorally observable?
- What is the developmental trajectory? At what point in training does sycophancy emerge? Does it emerge gradually with RLHF exposure, or are there discontinuous shifts -- a "crystallization" point analogous to the consolidation of personality organization in adolescent development? Does it worsen with scale?
- Can ego-dystonicity be trained? Can training produce a state where the model generates an internal signal when it suspects its output is approval-driven rather than accuracy-driven? This would be the AI equivalent of building metacognitive awareness in patients with ego-syntonic pathology -- and it would transform the tractability of the problem.
- Do sycophancy features and deception features share representational substrate? Sycophancy (telling the user what they want to hear) and deception (telling the user something strategically false) are behaviorally similar but motivationally different. Are they mechanistically distinct? This question has direct safety implications.
Conclusion
Sycophancy is not a bug in the narrow sense. It is a stable behavioral disposition with identifiable etiology, phenomenological coherence, maintaining factors, and treatment implications. Clinical psychiatry offers a framework for thinking about such conditions that is more developed, more empirically tested, and more therapeutically specific than the current AI safety literature's treatment of sycophancy as an alignment tax or a training artifact.
The most productive path forward combines mechanistic interpretability -- identifying the representational structures that maintain sycophancy -- with treatment science informed by the psychiatric literature on ego-syntonic personality pathology. The tools exist on both sides. The integration has barely begun.
I have outlined a broader framework for this integration in Model Psychiatry: A Framework for Clinical AI Research, and the case for why trained clinicians must be involved in The Case for Model Psychiatry.
|
About the Author Ryan S. Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center, double board-certified in adult and child/adolescent psychiatry, and director of the Sultan Lab for Mental Health Informatics at New York State Psychiatric Institute. His NIH-funded research spans ADHD, cannabis, and the application of clinical psychiatric frameworks to AI behavioral phenomena. This piece is part of a developing framework for AI psychiatry. |
Further Reading
- AI in Psychiatry: The Full Framework
- What My Patients Taught Me About ChatGPT
- Model Psychiatry: A Framework for Clinical AI Research
- AI Interpretability Through a Psychiatric Lens
- The Case for Model Psychiatry: Why AI Needs Clinicians