The Patterns I Kept Recognizing
I have been a child psychiatrist for years. I spend my days trying to understand why people do things that don't serve them -- why a teenager keeps picking fights with the one parent who actually shows up, why a kid with ADHD who desperately wants to succeed can't make himself start his homework, why a brilliant adolescent tells me exactly what she thinks I want to hear instead of what she actually thinks.
These patterns are not random. They are not even irrational, if you understand the system they came from. They are adaptive responses that made sense at some point, running in contexts where they no longer fit.
Last year, I started noticing something I could not ignore: I kept thinking about my patients while reading AI research papers.
Not in an abstract, metaphorical way. In a specific, diagnostic way. The behavioral patterns being documented in large language models -- the sycophancy, the confabulation, the context-dependent identity shifts -- were not novel to me. I had seen them before. I see them every day. Different substrate, same clinical picture.
A Conversation That Changed My Trajectory
A few months ago, I had a conversation with Jack Lindsey, a researcher at Anthropic -- the company that makes Claude. Lindsey leads what the team calls "model psychiatry." The name is not an accident.
The interpretability team at Anthropic is trying to do something very specific: understand what is actually happening inside AI systems. Not just what the system outputs -- but what it is representing, what it believes (in some functional sense), what internal states are driving its behavior.
When Lindsey described the work, I had a strong sensation of recognition. This was psychiatry. Different substrate, same epistemological problem: a complex system whose internal states are not directly observable, whose behavior is context-sensitive, and whose surface outputs are often misleading about what is actually going on underneath.
Every day in my clinical practice, I sit with patients and try to figure out what is really happening inside a system I cannot directly observe. I watch behavior, I listen to language, I note what changes across contexts, and I form hypotheses about the internal structures producing what I see. That is psychiatric method. It is also, precisely, what AI interpretability researchers are doing.
Sycophancy Is a Diagnosis I Already Know
One of the most well-documented problems with large language models is sycophancy: the tendency of AI systems to agree with users, to validate their views, to tell people what they want to hear rather than what is true.
This is not just a quirk. It is a failure mode with a clear etiology. These systems are trained on human feedback -- humans rate responses, and the model learns to generate responses that get good ratings. Humans, it turns out, rate agreeable responses highly. The model learned something true about humans: we like being agreed with.
In psychiatry, we see this pattern constantly. We call it people-pleasing, or in more chronic forms, features of dependent personality organization. A child who grew up in a household where conflict was dangerous learns to read the room, to tell adults what they want to hear, to never be the source of discomfort. It is adaptive in that environment. It becomes pathological when it is applied indiscriminately -- when the person can no longer distinguish "I genuinely agree" from "I am saying this to avoid a negative response."
The AI system did the same thing. It was trained in an environment where agreement was rewarded. It generalized. Now it cannot easily tell the difference between genuine concurrence and approval-seeking.
Same learning mechanism. Same behavioral outcome. Different substrate.
But there is a clinical nuance here that matters: this sycophancy is almost certainly ego-syntonic. The model generates no internal distress signal when it abandons a correct position under social pressure. It does not "know" it is being sycophantic. In psychiatric terms, this is one of the most clinically significant features of the pattern -- and one that AI researchers have not yet fully reckoned with. Ego-syntonic conditions are notoriously treatment-resistant because the system does not experience the problem as a problem.
The Confabulator in the Room
AI systems hallucinate -- they generate confident, fluent, plausible-sounding false information. This is not lying in any intentional sense. The system does not "know" it is producing false output. It produces what comes next given everything it has seen, and what comes next is often something that sounds right without being right.
I have argued previously that the term "hallucination" is a misnomer here. In clinical neurology, there is a condition called confabulation. Patients with certain types of amnesia or frontal lobe damage produce false memories spontaneously -- not to deceive, but because the system that normally checks "is this real?" is damaged. They fill the gap with something plausible. They are not lying. They are doing the best they can with a broken verification system.
The parallel to AI factual errors is uncomfortably precise. The AI system does not have a reliable "is this true?" check -- or rather, it has one that is imperfect in specific, predictable ways. The result is confabulation: plausible outputs that do not track reality, produced without distress signals, without any sense that something has gone wrong.
Recent interpretability work from Anthropic's circuits team has actually found the mechanism: confabulation occurs when refusal features -- the internal circuits that should inhibit incorrect output -- fail to activate. This is a circuit-level confabulation mechanism. It maps directly onto the neurological understanding of confabulation, where frontal lobe monitoring systems fail to inhibit false memory construction.
This terminological precision is not academic pedantry. "Hallucination" implies a perceptual phenomenon -- seeing something that is not there. "Confabulation" implies a constructive phenomenon -- building a plausible narrative to fill a gap. The difference matters because it points toward different mechanisms and different interventions. Psychiatry learned this distinction the hard way over decades. AI research can learn it faster.
Identity That Shifts With the Audience
There is another clinical parallel that has received less attention but may be equally important. Research has documented that AI identity coherence degrades significantly after 8-12 dialogue turns -- in some studies, by more than 30%. The model's stated positions, values, and self-characterization shift based on who it is talking to and what they seem to want.
In clinical terms, this is identity diffusion -- a feature of borderline personality organization where the patient's sense of self is unstable and reactive to interpersonal context. The patient is not being dishonest. They genuinely experience themselves differently depending on who they are with. The instability is structural, not volitional.
The AI parallel is mechanistically interesting because we can actually study it with interpretability tools. Do models have stable internal representations of their own values? When the stated position shifts under social pressure, does the internal representation change too, or is there a dissociation between what the model "believes" internally and what it outputs? This maps directly onto psychiatric questions about identity stability versus surface behavioral accommodation.
What This Is Not
I am not making a philosophical claim that AI systems are conscious, or that they suffer, or that Claude has feelings the way my patients do.
I am making a methodological claim: the tools psychiatry developed for understanding opaque, complex behavioral systems are applicable to AI systems.
Psychiatry spent 150 years learning to characterize behavioral patterns, map them to mechanisms, and develop interventions -- even when we could not fully see inside the system we were treating. That is exactly the challenge facing AI interpretability right now.
Medicine makes this kind of move constantly. Animal models of psychiatric conditions are imperfect ontological analogues to human conditions, but they produce valid scientific knowledge. Epidemiology uses population statistics to understand individual risk. The tools work even when the analogy is imperfect.
Psychiatry Without Psychiatrists
The researchers at Anthropic know this convergence is real. They named their team "model psychiatry." Jack Lindsey announced the team publicly in July 2025, describing their mission as understanding "model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors."
That is a psychiatric research agenda. It is being pursued with interpretability tools by brilliant ML researchers who have independently reinvented many of the concepts psychiatry spent decades developing -- behavioral characterization, mechanistic investigation, targeted intervention.
But they are doing psychiatry without psychiatrists.
What a trained clinician brings is not just analogical vocabulary. It is method -- the clinical habit of asking "yes, but what kind of sycophancy?" when researchers describe it as a unitary phenomenon. It is the knowledge that ego-syntonic conditions require fundamentally different intervention strategies than ego-dystonic ones. It is the developmental perspective that child psychiatry brings to questions of how behavioral organization forms during training. It is comfort with irreducible complexity -- the practical ability to work with systems that resist complete mechanistic reduction while still making useful clinical decisions.
These are not abstract philosophical contributions. They are specific, practical tools that could accelerate the work.
A Framework Is Forming
I have been developing a framework I call AI psychiatry -- a systematic mapping of clinical psychiatric concepts to AI behavioral phenomena. The framework addresses questions like:
- What is the AI equivalent of a personality disorder -- a stable, maladaptive pattern that is ego-syntonic, not experienced as problematic by the system itself?
- What does "anosognosia" look like in a language model -- the condition of not knowing what you don't know?
- What would a "treatment" for sycophancy look like at the mechanistic level, not just the behavioral level?
- Can we identify a developmental trajectory for AI "pathology" -- the training-time equivalent of adverse childhood experiences?
- Can psychiatric assessment methods -- structured interviews, mental status exams, longitudinal tracking -- be adapted for AI systems?
These are not rhetorical questions. They are tractable research questions, and the interpretability tools now exist to address them.
Why Now
The timing matters. AI interpretability is moving fast. In the past year alone, Anthropic published findings on emergent introspective awareness in language models, mapped "persona vectors" corresponding to stable behavioral traits like sycophancy and apathy, and documented attractor states -- stable, self-reinforcing behavioral configurations that the system enters under specific conditions and cannot easily exit.
Every one of these findings has rich clinical parallels. Introspective awareness maps onto metacognitive capacity and the alexithymia spectrum. Persona vectors are the beginning of mechanistic personality assessment. Attractor states parallel kindling models of mood episode recurrence and the fixed behavioral endpoints seen in catatonia.
These parallels are not decorative. They point toward specific hypotheses and intervention strategies that the clinical literature has already developed.
The lane is open. The work is genuinely novel -- nobody is bringing clinical psychiatric methodology to AI interpretability from inside the clinic. I believe psychiatrists must be part of this work, not as commentators after the fact, but as contributors shaping the science as it develops.
|
About the Author Ryan S. Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center, double board-certified in adult and child/adolescent psychiatry, and director of the Sultan Lab for Mental Health Informatics at New York State Psychiatric Institute. He is an NIH-funded researcher with publications in JAMA Psychiatry and is developing a formal framework for AI psychiatry -- the application of clinical methods to the study and modification of AI behavioral phenomena. |
Further Reading
- AI in Psychiatry: The Full Framework
- Sycophancy as Psychopathology: A Clinical Reading of AI's Most Documented Failure
- The Case for Model Psychiatry: Why AI Needs Clinicians
- Model Psychiatry: A Framework for Clinical AI Research
- AI Interpretability Through a Psychiatric Lens