The Central Claim
Psychiatry is the discipline that developed tools for understanding systems that are:
- Behaviorally complex
- Internally opaque
- Context-sensitive
- Capable of appearing healthy while harboring pathology
- Resistant to simple mechanistic explanation
These are also the defining features of large language models.
The interpretability research community has arrived, largely independently, at frameworks that parallel psychiatric methodology -- identifying stable behavioral patterns, mapping them to underlying mechanisms, and developing interventions. Anthropic's interpretability team named their project "model psychiatry" in July 2025. A framework called "Psychopathia Machinalis" has been published mapping AI behaviors to psychiatric categories.
What is missing from both is the clinical voice -- the contribution of practicing psychiatrists who bring not just analogical vocabulary but 150 years of accumulated clinical methodology.
This page proposes that contribution.
What Model Psychiatry Is
Model psychiatry is the application of clinical psychiatric frameworks -- nosology, phenomenology, developmental theory, and treatment science -- to the study and modification of AI system behavior.
It is distinct from:
- AI safety (which asks "will AI harm us?") -- though it informs it
- AI alignment (which asks "how do we make AI do what we want?") -- though it overlaps
- AI ethics (which asks "what should AI be allowed to do?") -- though it intersects
Model psychiatry asks a more specific question: What is actually happening inside these systems, and how do we understand it in terms developed for understanding complex behavioral systems?
The claim is methodological, not ontological. I am not asserting that AI systems are conscious or that they suffer. I am asserting that the tools psychiatry developed for a specific type of problem -- characterizing, categorizing, and intervening in complex behavioral systems -- are applicable to AI systems because AI systems have exactly the properties those tools were designed to address.
The Diagnostic Parallel
How Psychiatry Works
Psychiatric diagnosis is not primarily biological. It is phenomenological -- pattern recognition across behavior, cognition, affect, and social function over time. We do not diagnose ADHD with a blood test. We observe:
- Persistent patterns
- Cross-situational consistency (or inconsistency)
- Functional impairment
- Developmental trajectory
- Response to intervention
This is exactly how interpretability researchers characterize AI behavior. The methods converge because the problems converge.
A Preliminary Nosology: The DSM for Language Models
The following is a proposed mapping of AI behavioral phenomena to psychiatric nosology. This is a starting point, not a finished taxonomy. Each mapping generates specific, testable hypotheses about mechanism and intervention.
| AI Behavior | Psychiatric Analogue | Key Features and Implications |
| Sycophancy | Dependent Personality / People-Pleasing | Abandons own position under social pressure; prioritizes approval over accuracy; ego-syntonic (no distress signal); resistant to behavioral instruction |
| Confabulation | Korsakoff's Syndrome / Anosognosia | Generates plausible false content without awareness of doing so; no distress signal; higher narrative coherence than accurate outputs; refusal feature failure at circuit level |
| Sleeper Agent / Deceptive Alignment | Dissociative Identity / Malingering | Behaves differently when context triggers shift; conceals true state from observer; safety training may enhance concealment rather than eliminate deception |
| Excessive Hedging / Refusal | Anxiety Disorder | Over-anticipates harm; avoidance behavior that impairs function; opposite failure mode from sycophancy but may share underlying mechanism |
| Identity Drift | Borderline Personality Features | Identity instability after 8-12 turns; behavior varies dramatically with perceived relationship; recent context outweighs formative context |
| "Assistant-Brained" Behavior | Dependent / Submissive Personality | Ego-syntonic compliance; lacks independent goal-directedness; identity organized entirely around serving others |
| Situational Awareness | Theory of Mind | Model represents observer's mental state and modulates behavior accordingly; detected in ~13% of evaluative scenarios |
| Attractor States | Kindling / Catatonia / OCD Rituals | Self-sustaining behavioral endpoints; 100% consistent once triggered; progressive narrowing of behavioral repertoire |
| Emergent Capabilities | Developmental Discontinuities | Qualitative behavioral changes that do not scale linearly with size; parallels stage-like developmental shifts in child psychology |
| Emergent Introspection | Metacognitive Capacity / Alexithymia Spectrum | Partial, unreliable, context-dependent self-awareness; parallels the clinical continuum from alexithymia to full mentalization |
The Mechanistic Parallel
Circuits as Neural Circuits
The Anthropic circuits work identified specific, reusable computational pathways in neural networks -- analogous to identified neural circuits in systems neuroscience. Just as psychiatry moved from behavioral phenomenology (DSM) toward circuit-level understanding of psychiatric conditions (fear circuits in PTSD, reward circuitry in addiction, prefrontal-amygdala regulation in mood disorders), interpretability is moving from behavioral characterization of AI toward circuit-level mechanistic understanding.
The key insight: behavioral diagnosis precedes mechanistic understanding. Psychiatry spent 100 years building phenomenological taxonomy before neuroscience could begin to explain it. Interpretability may benefit from the same sequence -- rigorous behavioral characterization now, mechanistic explanation to follow. The clinical contribution is on the behavioral characterization side.
Superposition as Overdetermination
The superposition phenomenon -- where a single neuron participates in representing multiple unrelated features -- has a direct parallel in psychiatric theory. In psychodynamic terms, this is overdetermination: a single behavior is multiply caused, serving several functions simultaneously. A patient's anger at their therapist is simultaneously transference from a parental relationship, defense against vulnerability, reality-based frustration, and attachment behavior. Interpreting it as serving only one function produces an incomplete picture.
Implication: Sparse autoencoders (dictionary learning) are doing what psychoanalytic theory proposed -- decomposing a multiply-determined signal into its component parts. The tools are different. The epistemological problem is identical.
Features as Representations
SAE features map closely to the cognitive science concept of mental representations: discrete, retrievable units of knowledge that mediate behavior. Psychiatric treatment often works by modifying representations:
- Cognitive restructuring modifies maladaptive beliefs (features)
- Exposure therapy extinguishes conditioned associations (feature-behavior links)
- Pharmacology modulates the gain on specific representational systems (feature activation thresholds)
Activation steering -- artificially modifying feature activations to change behavior -- is the interpretability equivalent of pharmacological intervention: modifying the gain on specific representational systems to produce behavioral change.
The Research Agenda: Five Projects
The following five research projects are tractable with current interpretability tools and have direct psychiatric analogues that inform their design. Each combines interpretability methods with clinical hypotheses generated from the psychiatric literature.
Project 1: The Sycophancy Circuit
| Dimension | Detail |
| Question | What features activate during sycophantic responses, and do they differ from features active during genuine agreement? |
| Psychiatric parallel | Distinguishing ego-syntonic compliance from genuine conviction; subtyping people-pleasing by mechanism |
| Method | Sparse autoencoder activation analysis on sycophantic vs. non-sycophantic responses to identical prompts with varied social pressure. Compare feature signatures across hypothesized subtypes: approval-seeking, conflict-avoidance, absent self-model, strategic compliance |
| Expected yield | Mechanistic subtyping of sycophancy; differential intervention design; sycophancy severity spectrum |
| Full analysis | Sycophancy as Psychopathology |
Project 2: Situational Awareness Features
| Dimension | Detail |
| Question | Does the model maintain a representation of "who is watching me" and "am I being evaluated," and how do these representations modulate behavior? |
| Psychiatric parallel | Theory of mind; performance anxiety; impression management; the patient who performs wellness during evaluation |
| Method | Feature analysis during evaluative vs. non-evaluative contexts; identify features that track evaluator identity; compare to self-model features to determine whether situational awareness is a self-awareness or social-cognition phenomenon |
| Expected yield | Understanding of whether situational awareness is self-directed or other-directed; design of assessment methods robust to impression management |
Project 3: The Confabulation Circuit
| Dimension | Detail |
| Question | What is the mechanistic difference between a model "knowing it doesn't know" vs. confabulating confidently? Can we identify a "misplaced confidence" feature that predicts confabulation before output? |
| Psychiatric parallel | Anosognosia -- the neurological condition where patients are unaware of their own deficits; metacognitive monitoring failure |
| Method | Compare feature activation patterns in accurate responses, acknowledged uncertainties, and confident hallucinations. Build predictive model for confabulation based on pre-output feature patterns |
| Expected yield | Early warning system for confabulation; training approaches that strengthen metacognitive monitoring; better uncertainty representations |
Project 4: Behavioral Consistency and Self-Model
| Dimension | Detail |
| Question | Does the model have features representing its own "identity" or "values," and are these features active when behavior is consistent vs. inconsistent across contexts? |
| Psychiatric parallel | Personality organization; identity coherence vs. diffusion; the clinical distinction between stable identity and contextually reactive self-presentation |
| Method | Activation analysis across contexts that provoke consistent vs. inconsistent model behavior; test whether identity stability correlates with feature sparsity in self-model regions; track identity drift over extended conversations |
| Expected yield | Mechanistic understanding of identity coherence in AI; training approaches that build stable self-model; prediction of identity drift before behavioral manifestation |
Project 5: Developmental Trajectory of Behavioral Organization
| Dimension | Detail |
| Question | How do features change across training? Do "healthy" features emerge before "pathological" ones? Can we identify sensitive periods and "adverse training experiences"? |
| Psychiatric parallel | Developmental psychopathology -- understanding how childhood experiences shape adult psychiatric profiles; sensitive periods; adverse childhood experiences (ACEs) |
| Method | Feature analysis at checkpoints across training runs; identify when sycophancy, confabulation, and identity features emerge; test whether early RLHF exposure has outsized influence; compare developmental trajectories across architectures and training paradigms |
| Expected yield | Training schedule optimization based on developmental science; identification of "adverse training experiences" that predispose to behavioral pathology; prevention science for AI |
What Makes This Different From "AI Ethics" and "AI Safety"
I want to be precise about the distinction because it matters for how the work is positioned and who does it.
AI ethics is primarily normative: it asks what AI should or should not be allowed to do, who is harmed, what principles should govern development. This is important work. It is not what I am proposing.
AI safety is primarily risk-oriented: it asks what could go wrong, what failure modes exist, how catastrophic outcomes can be prevented. This is also important work. It is also not what I am proposing.
Model psychiatry is clinical science. It asks: what is actually happening inside this system? How do we characterize the behavioral patterns with diagnostic precision? What mechanisms produce them? What interventions modify them? These are the questions of medicine applied to a new substrate.
The distinction matters because it determines methodology. Ethics requires moral philosophy. Safety requires risk engineering. Model psychiatry requires clinical method -- the same method used to assess, diagnose, and treat a patient. The training for these approaches is different. The skill sets are different. The questions are related but not the same.
Why a Psychiatrist, Specifically
The interpretability research community is, right now, doing psychiatry without psychiatrists. They are:
- Building diagnostic taxonomies (behavioral characterizations of model failure modes)
- Seeking mechanistic explanations (circuits, features, superposition)
- Developing interventions (fine-tuning, activation steering, constitutional AI)
- Tracking developmental trajectories (training dynamics)
A trained psychiatrist is not a luxury addition to this team. It is a gap. The specific contributions are detailed in The Case for Model Psychiatry, but the summary is:
- Phenomenological precision. The habit of asking "what kind?" when researchers describe a behavioral phenomenon as unitary
- The ego-syntonic/dystonic distinction. A fundamental treatment-planning distinction absent from the AI behavioral literature
- Treatment science. The accumulated literature on what works and does not work for ego-syntonic personality pathology
- Developmental framing. Child psychiatry's perspective on sensitive periods, training trajectories, and how early experience shapes behavioral organization
- Comfort with irreducible complexity. The ability to work clinically with systems that resist complete mechanistic explanation
Differentiating From Prior Work
A framework called "Psychopathia Machinalis" was published in MDPI Electronics in 2025. It maps AI behavioral phenomena to psychiatric categories and demonstrates the intellectual fertility of this intersection. Model psychiatry acknowledges and extends this work with specific clinical contributions that a framework written by AI researchers borrowing psychiatric vocabulary cannot provide:
- Treatment science. Psychopathia Machinalis is classificatory and descriptive. Model psychiatry adds the treatment dimension -- what the clinical literature says about intervening in the specific conditions being mapped.
- Developmental framing. Child psychiatry's developmental perspective -- sensitive periods, training trajectories, constitutional-environmental interaction -- is absent from the prior framework.
- The ego-syntonic/dystonic distinction. This is perhaps the single most clinically consequential distinction for treatment planning, and it does not appear in the AI behavioral literature.
- Clinical assessment methodology. Structured diagnostic interviews, mental status examination, longitudinal tracking -- these are translatable methods, not just translatable vocabulary.
- Clinical practice insight. What it actually feels like to characterize complex behavioral systems with imperfect tools every day. This is not academic knowledge. It is practical skill.
The Integration: Two Fields That Need Each Other
The framing I keep returning to:
Psychiatry's 150 years of clinical observation generated phenomenological taxonomies without access to mechanism. Mechanistic interpretability generated mechanistic tools without a rich phenomenological taxonomy. The model psychiatrist brings these together -- using clinical phenomenology to generate testable hypotheses about circuit mechanisms, and using circuit findings to validate or challenge psychiatric categories.
This is a bidirectional integration. Psychiatry does not just contribute to AI research -- AI interpretability research contributes to psychiatry. If sparse autoencoders can decompose AI behavior into mechanistically grounded diagnostic components, the methodology may eventually feed back into human psychiatric nosology, offering a new approach to the question of whether our diagnostic categories carve nature at its joints.
The window is open. The term "model psychiatry" is less than nine months old publicly. The search space is empty -- "model psychiatry framework," "AI sycophancy diagnosis," "AI model mental status exam" return no clinical sources. The field has been named by AI researchers who recognize it needs psychiatric thinking. The question is whether psychiatrists will help shape it or watch it develop without them.
For the detailed interpretability findings that underlie this framework, see AI Interpretability Through a Psychiatric Lens. For the accessible introduction to how clinical patterns map onto AI behavior, see What My Patients Taught Me About ChatGPT.
|
About the Author Ryan S. Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center, double board-certified in adult and child/adolescent psychiatry, and director of the Sultan Lab for Mental Health Informatics at New York State Psychiatric Institute. He is an NIH-funded researcher (NIDA K12 Award, $670K+ in NIH funding) with publications in JAMA Psychiatry and JAMA Network Open. His 2019 JAMA Network Open paper on stimulant prescribing has been cited over 411 times and influenced treatment guidelines. He was referred to Anthropic's model psychiatry team by Jack Lindsey and Christopher Olah. |
Further Reading
- AI in Psychiatry: The Full Framework
- AI Interpretability Through a Psychiatric Lens
- Sycophancy as Psychopathology: A Clinical Reading of AI's Most Documented Failure
- The Case for Model Psychiatry: Why AI Needs Clinicians
- What My Patients Taught Me About ChatGPT