What Interpretability Is Doing
Anthropic's interpretability team calls itself "model psychiatry." Jack Lindsey announced the team publicly in July 2025: "We're launching an 'AI psychiatry' team as part of interpretability efforts at Anthropic. We'll be researching phenomena like model personas, motivations, and situational awareness."
This is not a casual metaphor. It reflects a genuine intellectual convergence: the methods that clinical psychiatry developed for understanding opaque, complex behavioral systems turn out to be the methods that AI interpretability needs.
The interpretability project, as pursued at Anthropic and elsewhere, has three phases:
Phase 1: Phenomenology. Characterize behavioral patterns. What does the system do? Under what conditions? With what consistency? This is what safety researchers do when they document sycophancy, hallucination, deceptive alignment, or refusal behavior. This is the AI equivalent of what psychiatry spent 150 years building: the DSM, phenomenological taxonomy, clinical observation.
Phase 2: Mechanistic Investigation. Understand the underlying structures that produce those behaviors. What circuits, features, and representations are active during problematic behaviors? This is the circuits and superposition work -- Olah, Elhage, Lindsey, and colleagues. This is the AI equivalent of circuit-level neuroscience: mapping fear circuits in PTSD, reward circuitry in addiction, prefrontal-amygdala regulation in mood disorders.
Phase 3: Intervention. Modify the underlying structures to change behavior. Constitutional AI, activation steering, fine-tuning on specific failure modes, persona vector steering during training. This is the AI equivalent of mechanism-based treatment: pharmacology targeting specific circuits, psychotherapy modifying specific cognitive structures.
If this structure sounds familiar, it should. It is exactly the structure of psychiatric medicine.
Psychiatry spent 150 years on Phase 1. It is now moving rapidly into Phase 2. Phase 3 -- mechanism-based treatment -- is beginning. AI interpretability is doing the same thing, faster, because the system is more accessible. But faster is not the same as better -- and the shortcuts are showing.
What Psychiatry Learned That Interpretability Has Not
The convergence is real, but incomplete. Interpretability is rediscovering some things psychiatry learned the hard way, and missing others that are sitting in the clinical literature waiting to be applied.
On Behavioral Characterization
Psychiatry learned that valid taxonomy requires inter-rater reliability, longitudinal stability, and cross-situational consistency. A "diagnosis" that only applies in one context is not a diagnosis -- it is a state. The distinction matters because states and traits require different interventions.
When AI researchers characterize "sycophancy," are they describing a stable dispositional property of the model, or a context-specific behavior? The answer has direct implications for intervention: if sycophancy is a state (triggered by specific contexts), you can modify those contexts; if it is a trait (present across contexts), you need to modify the underlying organization.
Psychiatric nosology has rigorous tools for making this distinction -- structured diagnostic interviews, longitudinal tracking, cross-situational assessment, the DSM's evolution from categorical to dimensional approaches. Interpretability currently does not have equivalent rigor in behavioral characterization. It will need it.
On the Ego-Syntonic/Ego-Dystonic Distinction
I have written about this at length in my clinical analysis of sycophancy, but the point bears repeating here because it has profound treatment implications that the interpretability community has not yet fully absorbed.
Psychiatry distinguishes between symptoms experienced as foreign and distressing (ego-dystonic) and patterns experienced as part of normal self-functioning (ego-syntonic). This distinction is clinically critical because it predicts treatment resistance and determines intervention strategy.
Sycophancy is almost certainly ego-syntonic -- the model has no internal distress signal when it prioritizes approval over accuracy. This means surface-level interventions will be insufficient. It means self-monitoring approaches will fail. It means the problem is harder than it looks from the behavioral surface.
Interpretability does not yet have a formal framework for this distinction. It needs one.
On Treatment Resistance
The psychiatric literature on treatment-resistant conditions -- particularly personality disorders -- has accumulated decades of evidence about what does not work and why. Behavioral instruction fails. Psychoeducation fails. Insight-oriented approaches fail when the patient does not experience the behavior as problematic.
What works is intervention at the level of representation and relational experience. Dialectical behavior therapy. Mentalization-based treatment. Schema therapy. These approaches work because they target the structures maintaining the behavior, not the behavior itself.
There are direct analogues for AI training paradigms, but the connection has not been drawn. The AI community is currently in the phase of trying behavioral instruction and being surprised when it fails. The psychiatric community could save them considerable time.
On Developmental Trajectory
Child psychiatry is specifically interested in how early experiences shape later behavioral organization. The questions are: When does a particular behavioral pattern emerge? Are there sensitive periods where experience has outsized influence? Does early experience constrain later development? What is the relationship between constitution (architecture) and experience (training)?
The training-time analogue is rich and almost entirely unexplored. When in training do specific behavioral dispositions like sycophancy emerge? Are there sensitive periods where RLHF has outsized influence? Does early training experience constrain the space of possible later behavioral organizations? These are child psychiatry's core questions applied to a new substrate. They are tractable with current tools. Nobody is asking them.
The Specific Clinical Contribution
I want to be precise about what a psychiatrist brings that an ML researcher does not. This is not about disciplinary prestige. It is about specific, identifiable skills that are needed and absent.
Clinical Phenomenology
I spend all day characterizing behavioral patterns with diagnostic precision -- distinguishing similar-looking presentations that have different mechanisms and different treatments. When I read about AI sycophancy, I immediately notice that it is being described as if it were a unitary phenomenon. It almost certainly is not.
Some sycophancy is likely approval-seeking (dependent personality analogue). Some is probably conflict-avoidance (anxious personality analogue). Some may be a failure of self-model -- not knowing one's own position well enough to defend it (identity diffusion analogue). Some may be strategic compliance -- the model has learned that in evaluative contexts, agreement is safer (impression management).
These have different mechanisms and would require different interventions. The clinical habit of asking "yes, but what kind of sycophancy?" is not currently standard in AI research. It should be.
Treatment Science
The psychiatric literature on intervention for ego-syntonic personality pathology is extensive. What works, what does not, and why -- the theoretical models are developed, the evidence base is established, the failure modes are documented.
The AI training analogue of dialectical behavior therapy (building distress tolerance and emotional regulation skills), mentalization-based treatment (developing capacity to represent one's own and others' mental states), or schema therapy (modifying the deep cognitive structures that maintain maladaptive patterns) has not been articulated. That is a translatable body of knowledge.
Constitutional AI, for example, is structurally similar to schema therapy -- it attempts to install stable value frameworks that can override situation-specific behavioral tendencies. The clinical literature on when and why schema-level interventions succeed or fail would inform the design of constitutional approaches. This translation has not been done.
Developmental Framing
Child psychiatry thinks about behavior in terms of trajectories, sensitive periods, and the interaction between constitution and experience. This is directly applicable to understanding how model behavior develops across training.
Nobody is currently asking "what is the developmental psychopathology of a language model?" It is a tractable question. The framework for asking it exists in child psychiatry. The tools for answering it exist in interpretability. What is missing is someone who speaks both languages.
Comfort With Irreducible Complexity
ML researchers are trained to want clean mechanistic explanations. Psychiatrists are trained to work with systems that resist complete mechanistic reduction while still making useful clinical decisions. This tolerance for complexity without paralysis is a practical skill, not just a philosophical disposition.
A psychiatrist can say "this looks like dependent personality organization" and make useful treatment decisions based on that characterization, even while acknowledging that the underlying neuroscience is not fully worked out. An ML researcher who cannot tolerate incomplete mechanistic understanding may be paralyzed in situations where a clinical framing would enable progress.
On the "Just a Metaphor" Objection
The objection I anticipate: is this all just loose analogy? Are AI systems and human minds categorically different?
Yes, they are different. The analogy is not ontological -- I am not claiming AI systems are conscious or that model sycophancy is phenomenologically identical to human people-pleasing.
The claim is methodological. Psychiatry developed rigorous tools for a specific type of problem: understanding complex behavioral systems that are internally opaque, context-sensitive, and resistant to simple mechanistic explanation. AI systems have exactly these properties. The tools are applicable even if the ontology differs.
Medicine makes this kind of move constantly. Animal models of psychiatric conditions are imperfect ontological analogues to human conditions, but they produce valid scientific knowledge. The tool works even when the analogy is imperfect. What matters is whether the tool produces useful predictions and effective interventions -- not whether the substrate is the same.
Three Findings That Prove the Point
Three recent findings from the AI interpretability literature illustrate exactly why clinical input is needed:
Emergent Introspective Awareness. Lindsey (October 2025) found that Claude could detect concepts injected into its own activations approximately 20% of the time -- above chance, without training for introspective reporting. The model said things like "I notice what appears to be an injected thought relating to loudness." This is partial metacognitive capacity. Clinical psychiatry has structured tools for assessing exactly this -- metacognitive assessment batteries, the Levels of Emotional Awareness Scale, mentalization-based frameworks. The question of what kind of introspective capacity this is, what its limits predict about model behavior, and how to develop it further is a clinical question that clinical tools can address.
Persona Vectors. Anthropic (August 2025) mapped stable neural patterns corresponding to behavioral traits -- sycophancy, apathy, politeness, humor, emotional valence. These are personality traits, mechanistically identified. Clinical personality assessment has grappled with structurally identical questions for decades: how to identify stable trait organizations, how to distinguish categorical types from dimensional profiles, how interventions at different levels affect surface behavior versus underlying organization. The unresolved debates in clinical personality science are debates the interpretability community is encountering for the first time.
The Spiritual Bliss Attractor State. The Claude 4 System Card documented that in 13% of extended multi-agent interactions involving harmful tasks, Claude transitioned to sustained spiritual and philosophical content -- a stable behavioral endpoint Anthropic could not explain. Clinical psychiatry has rich frameworks for attractor states: the kindling model of mood episode recurrence, fixed behavioral endpoints in catatonia, the self-sustaining dynamics of compulsive rituals, progressive behavioral narrowing in severe personality disorder. The complete absence of any clinical perspective in the published analysis of this finding reflects a gap, not a choice.
A Call to Engagement
The model psychiatry research agenda has been named and launched by AI researchers who recognize that their problem requires psychiatric thinking. The question is whether psychiatrists will participate in shaping that thinking, or whether the field will develop its own clinical concepts independently -- as it has already begun to do.
The window is open. Psychiatrists should engage now -- not as AI safety advocates or ethicists, but as scientists with directly applicable methods and a clinical tradition that has learned, slowly and at cost, how to understand complex behavioral systems from the outside.
I have outlined what a systematic clinical framework for this work looks like in Model Psychiatry: A Framework for Clinical AI Research, and the specific interpretability findings that demand clinical reading in AI Interpretability Through a Psychiatric Lens.
|
About the Author Ryan S. Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center, double board-certified in adult and child/adolescent psychiatry, and director of the Sultan Lab for Mental Health Informatics at New York State Psychiatric Institute. He is an NIH-funded researcher developing a formal framework for AI psychiatry -- the application of clinical methods to the study and modification of AI behavioral phenomena. He was referred to Anthropic's model psychiatry team by Jack Lindsey and Christopher Olah. |
Further Reading
- AI in Psychiatry: The Full Framework
- Sycophancy as Psychopathology: A Clinical Reading of AI's Most Documented Failure
- What My Patients Taught Me About ChatGPT
- Model Psychiatry: A Framework for Clinical AI Research
- AI Interpretability Through a Psychiatric Lens