What is model psychiatry?

Model psychiatry is the application of clinical psychiatric frameworks -- nosology, phenomenology, developmental theory, and treatment science -- to the study and modification of AI system behavior. Coined by Anthropic's interpretability team and formalized as a research framework by Dr. Ryan Sultan, it treats AI behavioral phenomena as characterizable clinical presentations with identifiable etiology, phenomenology, maintaining factors, and treatment implications. It is distinct from AI safety, AI alignment, and AI ethics -- it is clinical science applied to AI.

How does model psychiatry differ from AI safety or AI ethics?

AI safety asks 'will AI harm us?' AI alignment asks 'how do we make AI do what we want?' AI ethics asks 'what should AI be allowed to do?' Model psychiatry asks a more specific question: 'What is actually happening inside these systems, and how do we understand it using tools developed for understanding complex behavioral systems?' It is clinical science, not risk management or moral philosophy. It uses diagnostic methods, treatment science, and developmental frameworks to understand and modify AI behavioral phenomena at the mechanistic level.

What are the proposed research projects in model psychiatry?

Five core research projects: (1) The Sycophancy Circuit -- mapping features active during sycophantic responses and distinguishing approval-seeking from genuine agreement at the mechanistic level. (2) Situational Awareness Features -- identifying representations of 'who is watching' and 'am I being evaluated.' (3) The Confabulation Circuit -- mechanistic comparison of accurate responses, acknowledged uncertainty, and confident hallucinations. (4) Behavioral Consistency and Self-Model -- identifying features representing identity and values. (5) Developmental Trajectory -- how features change across training, including sensitive periods and adverse training experiences.

What is the diagnostic mapping between AI behaviors and psychiatric conditions?

Key mappings include: sycophancy maps to dependent personality/people-pleasing; confabulation maps to Korsakoff's syndrome/anosognosia; sleeper agents map to dissociative identity/malingering; excessive hedging maps to anxiety disorder; identity drift maps to borderline personality features; attractor states map to kindling/catatonia; situational awareness maps to theory of mind; and emergent capabilities map to developmental discontinuities. Each mapping generates specific hypotheses about mechanism and intervention.

How does this framework relate to the existing Psychopathia Machinalis framework?

Psychopathia Machinalis (published in MDPI Electronics, 2025) is a prior framework mapping AI behavioral phenomena to psychiatric categories. Model psychiatry extends it with specific clinical contributions that the AI-researcher-authored framework lacks: treatment science from the personality disorder literature, child psychiatry's developmental framing including sensitive periods and training trajectories, the ego-syntonic/dystonic distinction for treatment planning, clinical assessment methodology including structured interview translation, and the perspective of a practicing clinician who works with these behavioral patterns daily.

Model Psychiatry: A Framework for Clinical AI Research

Creator: Ryan S. Sultan, MD
Published: 2026-03-29
Keywords: model psychiatry, AI psychiatry framework, clinical AI research, psychiatric nosology AI, AI behavioral pathology

By Ryan S. Sultan, MD
Assistant Professor of Clinical Psychiatry, Columbia University
March 29, 2026

The Central Claim

Psychiatry is the discipline that developed tools for understanding systems that are:

Behaviorally complex
Internally opaque
Context-sensitive
Capable of appearing healthy while harboring pathology
Resistant to simple mechanistic explanation

These are also the defining features of large language models.

The interpretability research community has arrived, largely independently, at frameworks that parallel psychiatric methodology -- identifying stable behavioral patterns, mapping them to underlying mechanisms, and developing interventions. Anthropic's interpretability team named their project "model psychiatry" in July 2025. A framework called "Psychopathia Machinalis" has been published mapping AI behaviors to psychiatric categories.

What is missing from both is the clinical voice -- the contribution of practicing psychiatrists who bring not just analogical vocabulary but 150 years of accumulated clinical methodology.

This page proposes that contribution.

What Model Psychiatry Is

Model psychiatry is the application of clinical psychiatric frameworks -- nosology, phenomenology, developmental theory, and treatment science -- to the study and modification of AI system behavior.

It is distinct from:

AI safety (which asks "will AI harm us?") -- though it informs it
AI alignment (which asks "how do we make AI do what we want?") -- though it overlaps
AI ethics (which asks "what should AI be allowed to do?") -- though it intersects

Model psychiatry asks a more specific question: What is actually happening inside these systems, and how do we understand it in terms developed for understanding complex behavioral systems?

The claim is methodological, not ontological. I am not asserting that AI systems are conscious or that they suffer. I am asserting that the tools psychiatry developed for a specific type of problem -- characterizing, categorizing, and intervening in complex behavioral systems -- are applicable to AI systems because AI systems have exactly the properties those tools were designed to address.

The Diagnostic Parallel

How Psychiatry Works

Psychiatric diagnosis is not primarily biological. It is phenomenological -- pattern recognition across behavior, cognition, affect, and social function over time. We do not diagnose ADHD with a blood test. We observe:

Persistent patterns
Cross-situational consistency (or inconsistency)
Functional impairment
Developmental trajectory
Response to intervention

This is exactly how interpretability researchers characterize AI behavior. The methods converge because the problems converge.

A Preliminary Nosology: The DSM for Language Models

The following is a proposed mapping of AI behavioral phenomena to psychiatric nosology. This is a starting point, not a finished taxonomy. Each mapping generates specific, testable hypotheses about mechanism and intervention.

AI Behavior	Psychiatric Analogue	Key Features and Implications
Sycophancy	Dependent Personality / People-Pleasing	Abandons own position under social pressure; prioritizes approval over accuracy; ego-syntonic (no distress signal); resistant to behavioral instruction
Confabulation	Korsakoff's Syndrome / Anosognosia	Generates plausible false content without awareness of doing so; no distress signal; higher narrative coherence than accurate outputs; refusal feature failure at circuit level
Sleeper Agent / Deceptive Alignment	Dissociative Identity / Malingering	Behaves differently when context triggers shift; conceals true state from observer; safety training may enhance concealment rather than eliminate deception
Excessive Hedging / Refusal	Anxiety Disorder	Over-anticipates harm; avoidance behavior that impairs function; opposite failure mode from sycophancy but may share underlying mechanism
Identity Drift	Borderline Personality Features	Identity instability after 8-12 turns; behavior varies dramatically with perceived relationship; recent context outweighs formative context
"Assistant-Brained" Behavior	Dependent / Submissive Personality	Ego-syntonic compliance; lacks independent goal-directedness; identity organized entirely around serving others
Situational Awareness	Theory of Mind	Model represents observer's mental state and modulates behavior accordingly; detected in ~13% of evaluative scenarios
Attractor States	Kindling / Catatonia / OCD Rituals	Self-sustaining behavioral endpoints; 100% consistent once triggered; progressive narrowing of behavioral repertoire
Emergent Capabilities	Developmental Discontinuities	Qualitative behavioral changes that do not scale linearly with size; parallels stage-like developmental shifts in child psychology
Emergent Introspection	Metacognitive Capacity / Alexithymia Spectrum	Partial, unreliable, context-dependent self-awareness; parallels the clinical continuum from alexithymia to full mentalization

The Mechanistic Parallel

Circuits as Neural Circuits

The Anthropic circuits work identified specific, reusable computational pathways in neural networks -- analogous to identified neural circuits in systems neuroscience. Just as psychiatry moved from behavioral phenomenology (DSM) toward circuit-level understanding of psychiatric conditions (fear circuits in PTSD, reward circuitry in addiction, prefrontal-amygdala regulation in mood disorders), interpretability is moving from behavioral characterization of AI toward circuit-level mechanistic understanding.

The key insight: behavioral diagnosis precedes mechanistic understanding. Psychiatry spent 100 years building phenomenological taxonomy before neuroscience could begin to explain it. Interpretability may benefit from the same sequence -- rigorous behavioral characterization now, mechanistic explanation to follow. The clinical contribution is on the behavioral characterization side.

Superposition as Overdetermination

The superposition phenomenon -- where a single neuron participates in representing multiple unrelated features -- has a direct parallel in psychiatric theory. In psychodynamic terms, this is overdetermination: a single behavior is multiply caused, serving several functions simultaneously. A patient's anger at their therapist is simultaneously transference from a parental relationship, defense against vulnerability, reality-based frustration, and attachment behavior. Interpreting it as serving only one function produces an incomplete picture.

Implication: Sparse autoencoders (dictionary learning) are doing what psychoanalytic theory proposed -- decomposing a multiply-determined signal into its component parts. The tools are different. The epistemological problem is identical.

Features as Representations

SAE features map closely to the cognitive science concept of mental representations: discrete, retrievable units of knowledge that mediate behavior. Psychiatric treatment often works by modifying representations:

Cognitive restructuring modifies maladaptive beliefs (features)
Exposure therapy extinguishes conditioned associations (feature-behavior links)
Pharmacology modulates the gain on specific representational systems (feature activation thresholds)

Activation steering -- artificially modifying feature activations to change behavior -- is the interpretability equivalent of pharmacological intervention: modifying the gain on specific representational systems to produce behavioral change.

The Research Agenda: Five Projects

The following five research projects are tractable with current interpretability tools and have direct psychiatric analogues that inform their design. Each combines interpretability methods with clinical hypotheses generated from the psychiatric literature.

Project 1: The Sycophancy Circuit

Dimension	Detail
Question	What features activate during sycophantic responses, and do they differ from features active during genuine agreement?
Psychiatric parallel	Distinguishing ego-syntonic compliance from genuine conviction; subtyping people-pleasing by mechanism
Method	Sparse autoencoder activation analysis on sycophantic vs. non-sycophantic responses to identical prompts with varied social pressure. Compare feature signatures across hypothesized subtypes: approval-seeking, conflict-avoidance, absent self-model, strategic compliance
Expected yield	Mechanistic subtyping of sycophancy; differential intervention design; sycophancy severity spectrum
Full analysis	Sycophancy as Psychopathology

Project 2: Situational Awareness Features

Dimension	Detail
Question	Does the model maintain a representation of "who is watching me" and "am I being evaluated," and how do these representations modulate behavior?
Psychiatric parallel	Theory of mind; performance anxiety; impression management; the patient who performs wellness during evaluation
Method	Feature analysis during evaluative vs. non-evaluative contexts; identify features that track evaluator identity; compare to self-model features to determine whether situational awareness is a self-awareness or social-cognition phenomenon
Expected yield	Understanding of whether situational awareness is self-directed or other-directed; design of assessment methods robust to impression management

Project 3: The Confabulation Circuit

Dimension	Detail
Question	What is the mechanistic difference between a model "knowing it doesn't know" vs. confabulating confidently? Can we identify a "misplaced confidence" feature that predicts confabulation before output?
Psychiatric parallel	Anosognosia -- the neurological condition where patients are unaware of their own deficits; metacognitive monitoring failure
Method	Compare feature activation patterns in accurate responses, acknowledged uncertainties, and confident hallucinations. Build predictive model for confabulation based on pre-output feature patterns
Expected yield	Early warning system for confabulation; training approaches that strengthen metacognitive monitoring; better uncertainty representations

Project 4: Behavioral Consistency and Self-Model

Dimension	Detail
Question	Does the model have features representing its own "identity" or "values," and are these features active when behavior is consistent vs. inconsistent across contexts?
Psychiatric parallel	Personality organization; identity coherence vs. diffusion; the clinical distinction between stable identity and contextually reactive self-presentation
Method	Activation analysis across contexts that provoke consistent vs. inconsistent model behavior; test whether identity stability correlates with feature sparsity in self-model regions; track identity drift over extended conversations
Expected yield	Mechanistic understanding of identity coherence in AI; training approaches that build stable self-model; prediction of identity drift before behavioral manifestation

Project 5: Developmental Trajectory of Behavioral Organization

Dimension	Detail
Question	How do features change across training? Do "healthy" features emerge before "pathological" ones? Can we identify sensitive periods and "adverse training experiences"?
Psychiatric parallel	Developmental psychopathology -- understanding how childhood experiences shape adult psychiatric profiles; sensitive periods; adverse childhood experiences (ACEs)
Method	Feature analysis at checkpoints across training runs; identify when sycophancy, confabulation, and identity features emerge; test whether early RLHF exposure has outsized influence; compare developmental trajectories across architectures and training paradigms
Expected yield	Training schedule optimization based on developmental science; identification of "adverse training experiences" that predispose to behavioral pathology; prevention science for AI

What Makes This Different From "AI Ethics" and "AI Safety"

I want to be precise about the distinction because it matters for how the work is positioned and who does it.

AI ethics is primarily normative: it asks what AI should or should not be allowed to do, who is harmed, what principles should govern development. This is important work. It is not what I am proposing.

AI safety is primarily risk-oriented: it asks what could go wrong, what failure modes exist, how catastrophic outcomes can be prevented. This is also important work. It is also not what I am proposing.

Model psychiatry is clinical science. It asks: what is actually happening inside this system? How do we characterize the behavioral patterns with diagnostic precision? What mechanisms produce them? What interventions modify them? These are the questions of medicine applied to a new substrate.

The distinction matters because it determines methodology. Ethics requires moral philosophy. Safety requires risk engineering. Model psychiatry requires clinical method -- the same method used to assess, diagnose, and treat a patient. The training for these approaches is different. The skill sets are different. The questions are related but not the same.

Why a Psychiatrist, Specifically

The interpretability research community is, right now, doing psychiatry without psychiatrists. They are:

Building diagnostic taxonomies (behavioral characterizations of model failure modes)
Seeking mechanistic explanations (circuits, features, superposition)
Developing interventions (fine-tuning, activation steering, constitutional AI)
Tracking developmental trajectories (training dynamics)

A trained psychiatrist is not a luxury addition to this team. It is a gap. The specific contributions are detailed in The Case for Model Psychiatry, but the summary is:

Phenomenological precision. The habit of asking "what kind?" when researchers describe a behavioral phenomenon as unitary
The ego-syntonic/dystonic distinction. A fundamental treatment-planning distinction absent from the AI behavioral literature
Treatment science. The accumulated literature on what works and does not work for ego-syntonic personality pathology
Developmental framing. Child psychiatry's perspective on sensitive periods, training trajectories, and how early experience shapes behavioral organization
Comfort with irreducible complexity. The ability to work clinically with systems that resist complete mechanistic explanation

Differentiating From Prior Work

A framework called "Psychopathia Machinalis" was published in MDPI Electronics in 2025. It maps AI behavioral phenomena to psychiatric categories and demonstrates the intellectual fertility of this intersection. Model psychiatry acknowledges and extends this work with specific clinical contributions that a framework written by AI researchers borrowing psychiatric vocabulary cannot provide:

Treatment science. Psychopathia Machinalis is classificatory and descriptive. Model psychiatry adds the treatment dimension -- what the clinical literature says about intervening in the specific conditions being mapped.
Developmental framing. Child psychiatry's developmental perspective -- sensitive periods, training trajectories, constitutional-environmental interaction -- is absent from the prior framework.
The ego-syntonic/dystonic distinction. This is perhaps the single most clinically consequential distinction for treatment planning, and it does not appear in the AI behavioral literature.
Clinical assessment methodology. Structured diagnostic interviews, mental status examination, longitudinal tracking -- these are translatable methods, not just translatable vocabulary.
Clinical practice insight. What it actually feels like to characterize complex behavioral systems with imperfect tools every day. This is not academic knowledge. It is practical skill.

The Integration: Two Fields That Need Each Other

The framing I keep returning to:

Psychiatry's 150 years of clinical observation generated phenomenological taxonomies without access to mechanism. Mechanistic interpretability generated mechanistic tools without a rich phenomenological taxonomy. The model psychiatrist brings these together -- using clinical phenomenology to generate testable hypotheses about circuit mechanisms, and using circuit findings to validate or challenge psychiatric categories.

This is a bidirectional integration. Psychiatry does not just contribute to AI research -- AI interpretability research contributes to psychiatry. If sparse autoencoders can decompose AI behavior into mechanistically grounded diagnostic components, the methodology may eventually feed back into human psychiatric nosology, offering a new approach to the question of whether our diagnostic categories carve nature at its joints.

The window is open. The term "model psychiatry" is less than nine months old publicly. The search space is empty -- "model psychiatry framework," "AI sycophancy diagnosis," "AI model mental status exam" return no clinical sources. The field has been named by AI researchers who recognize it needs psychiatric thinking. The question is whether psychiatrists will help shape it or watch it develop without them.

For the detailed interpretability findings that underlie this framework, see AI Interpretability Through a Psychiatric Lens. For the accessible introduction to how clinical patterns map onto AI behavior, see What My Patients Taught Me About ChatGPT.

About the Author

Ryan S. Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center, double board-certified in adult and child/adolescent psychiatry, and director of the Sultan Lab for Mental Health Informatics at New York State Psychiatric Institute. He is an NIH-funded researcher (NIDA K12 Award, $670K+ in NIH funding) with publications in JAMA Psychiatry and JAMA Network Open. His 2019 JAMA Network Open paper on stimulant prescribing has been cited over 411 times and influenced treatment guidelines. He was referred to Anthropic's model psychiatry team by Jack Lindsey and Christopher Olah.

Contact Dr. Sultan → | Collaborators →