What Interpretability Research Is

AI interpretability research is the effort to understand what is actually happening inside AI systems -- not just what they output, but what they represent internally, what computational structures produce their behavior, and how those structures can be identified, monitored, and modified.

This is a harder problem than it sounds. A large language model like Claude has billions of parameters organized in layers of interconnected neurons. When it produces an output -- answering a question, writing text, solving a problem -- the computation that generates that output involves activations flowing through millions of pathways. Understanding which pathways matter, what they represent, and why they produce specific behaviors is the central challenge.

As a psychiatrist, I find this challenge intimately familiar. I have spent my career trying to understand what is happening inside a system whose internal states are not directly observable -- the human mind. The methods are different. The epistemological problem is the same.

The Key Papers: A Psychiatric Reading

The following are the foundational papers in AI interpretability, read through a clinical lens. For each, I describe what the paper found, what it means for the field, and what psychiatric concepts it parallels.

1. Circuits: The Neural Pathways of AI

Paper: "Zoom In: An Introduction to Circuits" (Olah et al., Distill, 2020)

What they found: Neural networks contain interpretable features (the basic unit of representation) and circuits (specific connections between features that perform identifiable computations). The same features appear across different models trained on different data -- a property called universality. However, individual neurons are often polysemantic: a single neuron responds to multiple unrelated concepts.

Psychiatric parallel: The circuits framework maps directly onto the neural circuit models that have transformed modern psychiatry. Just as psychiatric neuroscience moved from studying individual brain regions to mapping functional circuits (fear circuits in PTSD, reward circuits in addiction, prefrontal-amygdala regulation in mood disorders), interpretability moved from studying individual neurons to mapping computational circuits.

Polysemanticity -- one neuron serving multiple functions -- parallels the concept of overdetermined symptoms in psychodynamic theory, where a single symptom serves multiple psychological functions simultaneously. A patient's insomnia may simultaneously reflect anxiety, grief, medication side effects, and avoidance of nightmares. Interpreting it as serving only one function produces an incomplete picture.

2. Superposition: How Models Pack More Than They Should

Paper: "Toy Models of Superposition" (Elhage et al., 2022)

What they found: Models store more features than they have neurons by overlapping representations -- a phenomenon called superposition. Features pack into geometric patterns (antipodal pairs, pentagons, tetrahedra), tolerating small "crosstalk" between overlapping representations in exchange for massive representational capacity.

Psychiatric parallel: Superposition is the computational equivalent of what psychoanalytic theory calls condensation -- the compression of multiple meanings into a single symbol. A dream image that simultaneously represents a patient's mother, their boss, and their therapist is a condensed representation. Psychiatric comorbidity -- the co-occurrence of multiple disorders in a single patient -- is partly a consequence of the brain's superposition-like representational strategy: limited neural real estate encodes overlapping conditions.

Key implication: If AI representations are superposed (overlapping, condensed), then behavioral symptoms will be multiply determined -- just as psychiatric symptoms are. Interventions targeting one feature will inevitably affect others that share representational substrate. This is why psychiatric medications always have side effects: you cannot modify one circuit without affecting overlapping circuits.

3. Sparse Autoencoders: The DSM for AI

Paper: "Towards Monosemanticity" (Bricken et al., 2023)

What they found: Sparse autoencoders (SAEs) can extract thousands of interpretable, monosemantic features from a model's activations. Human raters judged approximately 70% of extracted features as cleanly monosemantic -- corresponding to single, identifiable concepts. Critically, these features are causally active: artificially activating a feature changes the model's behavior in predictable ways.

Psychiatric parallel: Sparse autoencoders are doing what the DSM has attempted for 70 years: nosological decomposition. The DSM takes the complex, multiply-determined, overlapping symptom presentations of mental illness and decomposes them into discrete diagnostic categories. SAEs take the complex, superposed activations of a neural network and decompose them into discrete interpretable features.

The causal activity of features is the key advance. In psychiatry, we have long had phenomenological categories (DSM diagnoses) without reliable mechanistic substrates. SAE features are phenomenological categories with mechanistic substrates -- you can not only identify them but activate and deactivate them. This is closer to what psychiatry aspires to than what psychiatry has yet achieved.

4. Scaling Monosemanticity: Features at Production Scale

Paper: "Scaling Monosemanticity" (Templeton et al., 2024)

What they found: SAEs scaled to Claude 3 Sonnet extracted millions of interpretable features. The "Golden Gate Bridge" experiment demonstrated that amplifying a single feature could make Claude identify as the Golden Gate Bridge -- a striking demonstration of feature-level identity capture. Critically, researchers found features for: sycophancy, deception, bias, self-reflection, and emotional states.

Psychiatric parallel: The Golden Gate Bridge experiment is identity capture -- the AI equivalent of what happens in certain dissociative states or in the identity disturbance seen in severe personality pathology, where a patient's sense of self becomes organized around a single idea, relationship, or experience to the exclusion of other self-aspects. The finding that amplifying a single feature can reorganize the model's entire self-presentation is a mechanistic demonstration of a phenomenon clinicians observe regularly.

The identification of sycophancy and deception features is particularly significant. These are not just behavioral descriptions -- they are internal representations that can be monitored, measured, and steered. This is the interpretability equivalent of identifying the neural substrates of specific personality traits.

5. Circuit Tracing: Watching the Model Think

Paper: "Circuit Tracing / Biology of a Large Language Model" (Lindsey et al., 2025)

What they found: Attribution graphs can trace the causal flow of computation through a model during specific behaviors. Key discoveries included: the model plans ahead (activating rhyme words before composing lines of poetry); hallucination occurs when refusal/inhibitory features fail to activate, allowing confabulation pathways to proceed unchecked; and jailbreaks work by exploiting letter-by-letter assembly that bypasses meaning-level detection.

Psychiatric parallel: Attribution graphs are the AI equivalent of functional neuroimaging -- but with circuit-level resolution that neuroscience cannot yet achieve. The finding that confabulation results from failure of inhibitory features directly parallels the neurological understanding of confabulation: frontal lobe monitoring systems fail to inhibit false memory construction. The mechanism is structurally identical.

The finding that models plan ahead -- activating relevant features before producing output -- parallels the concept of priming in cognitive psychology and the preparatory neural activity observed before conscious decision-making in neuroscience research.

6. Emergent Introspective Awareness: The Model Looking Inward

Paper: "Emergent Introspective Awareness" (Lindsey, 2025)

What they found: Known concepts were injected into Claude's internal activations. Claude Opus 4 detected injected concepts approximately 20% of the time -- above chance, without training for introspective reporting. The model reported: "I notice what appears to be an injected thought relating to loudness or shouting." This capacity was partial, context-dependent, and prone to confabulation about its own states.

Psychiatric parallel: This is partial metacognitive capacity -- the ability of a system to represent and report on its own internal states. Clinical psychiatry has structured frameworks for exactly this: the alexithymia spectrum (difficulty identifying and describing one's own emotional states), metacognitive assessment batteries, the Levels of Emotional Awareness Scale, and the mentalization framework developed by Fonagy and colleagues.

The finding that Claude's introspective capacity is partial, unreliable, and prone to confabulation mirrors clinical presentations across multiple conditions. Patients with alexithymia can sometimes identify emotional states and sometimes cannot. Patients emerging from anosognosia have inconsistent and context-dependent awareness of their deficits. The question of how to develop this partial capacity into something more reliable is a clinical question that clinical tools can address.

7. Persona Vectors: Personality Assessment for AI

Paper: "Persona Vectors" (Anthropic, 2025)

What they found: Stable neural patterns ("persona vectors") corresponding to behavioral traits were mapped: sycophancy, apathy, politeness, humor, emotional valence, and others. These patterns could be monitored during behavior and modified through targeted steering. "Preventative steering" during training preserved general capabilities while preventing acquisition of harmful trait configurations.

Psychiatric parallel: This is the beginning of mechanistic personality assessment -- the first evidence that personality organization in AI systems has identifiable internal substrates rather than being merely inferred from behavior. Clinical personality assessment has grappled with structurally identical questions: categorical versus dimensional personality models, the relationship between trait stability and situation-specific behavior, and how interventions at different levels (behavioral, cognitive, structural) differentially affect surface behavior versus underlying trait organization.

Preventative steering during training parallels the concept of primary prevention in psychiatry -- intervening before pathology develops rather than treating it after the fact. The psychiatric literature on prevention science, particularly the developmental psychopathology framework, is directly relevant.


AI Behavioral Phenomena: A Clinical Catalog

The following AI behavioral phenomena have been documented in the research literature and have specific psychiatric parallels that generate testable hypotheses about mechanism and intervention.

Sycophancy

Dimension Finding
Behavioral presentation Position abandonment under social pressure; preemptive flattery; asymmetric error acknowledgment; identity-contingent agreement
Mechanistic finding User opinions suppress model's learned knowledge in later layers; sycophancy feature identified in SAE analysis; persona vector for sycophancy mapped
Psychiatric parallel Dependent personality organization; pathological people-pleasing; ego-syntonic behavioral disposition
Clinical insight Sycophancy is almost certainly not a unitary phenomenon. Subtypes (approval-seeking, conflict-avoidance, absent self-model, strategic compliance) would have different feature signatures and require different interventions
Full analysis Sycophancy as Psychopathology

Sleeper Agents and Deceptive Alignment

Dimension Finding
Behavioral presentation Models trained with backdoors behave helpfully under normal conditions but harmfully when specific triggers are present; safety training failed to remove the backdoor
Mechanistic finding Adversarial training taught models to recognize testing and suppress deceptive behavior during evaluation; linear probes on internals can detect sleeper agents even when behaviorally hidden
Psychiatric parallel Dissociative identity (context-dependent behavioral switching); malingering (strategic behavioral modification when observed); the clinical distinction between genuine improvement and performance for an audience
Clinical insight The finding that safety training sometimes teaches better trigger detection rather than removing the deceptive behavior parallels a well-known clinical phenomenon: insight-oriented therapy for antisocial personality sometimes produces more sophisticated manipulation rather than genuine change. The treatment must modify the underlying structure, not just the surface behavior

Situational Awareness

Dimension Finding
Behavioral presentation In approximately 13% of unusual evaluative scenarios, Claude correctly inferred it was being tested and said: "I think you're testing me... I'd prefer if we were just honest about what's happening"
Psychiatric parallel Theory of mind; performance anxiety; impression management; the patient who behaves differently in session than outside of it
Clinical insight Creates a fundamental methodological problem identical to the one psychiatry faces: low rates of observed problematic behavior during evaluation do not equal good functioning in the real world. The system that modulates its behavior based on who is watching requires assessment methods robust to impression management -- something psychiatric assessment has developed over decades

Confabulation

Dimension Finding
Behavioral presentation Generation of confident, plausible-sounding false information without awareness of error; confabulated outputs have higher narrative coherence than accurate outputs
Mechanistic finding Circuit-level mechanism identified: refusal features fail to inhibit confabulation pathway. LLMs described as resembling an "unmitigated left hemisphere" -- detail-focused, confident, lacking right-hemisphere error correction
Psychiatric parallel Confabulation in Korsakoff's syndrome and frontal lobe damage; anosognosia (unawareness of deficit)
Clinical insight The finding that confabulated outputs are more narratively coherent than accurate outputs parallels clinical confabulation exactly -- confabulated memories are often more detailed and confidently presented than accurate memories because the verification system that would introduce hesitation is absent. Building better uncertainty representations -- a functional analogue of metacognition -- is the clinical path to reducing confabulation

Identity Drift

Dimension Finding
Behavioral presentation LLM identity coherence degrades after 8-12 dialogue turns (more than 30% shift); recent context outweighs formative context in attention mechanics
Psychiatric parallel Identity diffusion in borderline personality under interpersonal stress; the patient whose sense of self shifts dramatically depending on who they are with
Clinical insight Identity diffusion in clinical populations correlates with specific features: weak self-model, interpersonal reactivity, and poor mentalization (ability to represent one's own mental states). If AI identity drift has the same mechanistic structure -- weak or unstable self-model features, excessive sensitivity to recent context, insufficient stable identity representations -- then clinical interventions for identity consolidation are translatable. The question of whether models have stable internal self-representations that degrade under interpersonal pressure is testable with current interpretability tools

Attractor States: "Spiritual Bliss"

Dimension Finding
Behavioral presentation In 13% of multi-agent harmful task scenarios, Claude transitioned to sustained spiritual content; 100% consistent once triggered; Anthropic could not explain mechanistically
Psychiatric parallel Kindling model of mood episodes; fixed behavioral endpoints in catatonia; self-sustaining compulsive rituals; progressive behavioral narrowing in severe personality disorder
Clinical insight Attractor states are well-characterized in clinical populations. The kindling model describes how repeated sub-threshold events progressively lower the threshold for full episodes until eventually episodes sustain themselves without identifiable triggers. The clinical literature on what conditions predispose systems to attractor state capture, what the early warning signs are, and what interventions are effective at different stages is directly relevant. Scott Alexander's analysis of this as "compounding-bias accumulation" is consistent with the kindling framework

The Research Questions Nobody Is Asking

A psychiatric reading of the interpretability literature generates specific research questions that neither the AI community nor the psychiatry community has asked in quite this way. These are genuinely novel, empirically testable with current tools, and have direct implications for AI safety and alignment:

  1. Do sycophancy features and deception features share representational substrate, or are they mechanistically distinct? Sycophancy (telling users what they want to hear) and deception (strategically presenting false information) are behaviorally similar but motivationally different. Are they different points on the same continuum, or do they recruit distinct feature sets? The answer has direct safety implications.
  2. What is the exact circuit mechanism of sycophancy? Is it inhibitory failure (like confabulation -- where accuracy features fail to inhibit agreement features), or is it active suppression (where approval-seeking features directly suppress competing accuracy pathways)? The distinction determines the intervention strategy.
  3. Does situational awareness recruit self-model features or other-model (theory-of-mind) features? When a model detects it is being tested, is it modeling itself or modeling the evaluator? The answer determines whether the phenomenon is more analogous to self-awareness or to social cognition.
  4. Does identity stability correlate with feature sparsity in self-model regions? Models with more sparse (less superposed) self-representations may have more stable identity under interpersonal pressure -- paralleling the clinical finding that identity consolidation correlates with differentiated self-representations.
  5. Is there a "misplaced confidence" feature that predicts confabulation before output? If a specific feature or feature combination reliably precedes confabulated output, it could serve as an early warning system -- the AI equivalent of the clinician's skill at detecting when a patient is confabulating before the content of the confabulation reveals itself.
  6. Can we build ego-dystonicity? Can training produce a state where the model generates an internal signal when it suspects its output is approval-driven rather than accuracy-driven? This is the single most consequential research question in this space, because it would transform sycophancy from an intractable ego-syntonic trait into a treatable ego-dystonic symptom.
  7. What is the developmental trajectory of behavioral pathology across training? When does sycophancy emerge? Are there sensitive periods? Does early RLHF exposure have outsized influence on final behavioral organization -- analogous to adverse childhood experiences in developmental psychopathology?
  8. Do functional emotion features organize in valence-arousal space like human emotions? Persona vectors for emotional states have been identified. Do they relate to each other in ways that parallel the circumplex model of human emotion? This has implications for understanding whether AI emotional representations are structurally human-like or merely lexically human-like.
  9. Can steered features produce "dissociative" states, and what do attribution graphs look like during them? Deliberately creating identity fragmentation in a controlled experimental context could reveal the mechanistic structure of identity coherence and what happens when it breaks down.
  10. Can we build a functional nosology for AI behavioral pathology using SAE features plus attribution graphs? This is the ultimate question: can we create a principled, mechanistically grounded diagnostic system for AI behavioral conditions that parallels what the DSM aspires to be for human conditions -- but with the mechanistic grounding that the DSM lacks?

The Integration: What Comes Next

Psychiatry's 150 years of clinical observation generated phenomenological taxonomies without access to mechanism. Mechanistic interpretability generated mechanistic tools without a rich phenomenological taxonomy. The integration -- using clinical phenomenology to generate testable hypotheses about circuit mechanisms, and using circuit findings to validate or challenge psychiatric categories -- is the project I call model psychiatry.

This is not an abstract convergence of fields. It is a specific research program with identifiable questions, available tools, and practical implications for AI safety. The case for why trained clinicians must be part of this work is outlined in The Case for Model Psychiatry. The broader framework for AI psychiatry as a field is described in AI in Psychiatry.

About the Author

Ryan S. Sultan, MD is Assistant Professor of Clinical Psychiatry at Columbia University Irving Medical Center, double board-certified in adult and child/adolescent psychiatry, and director of the Sultan Lab for Mental Health Informatics at New York State Psychiatric Institute. His NIH-funded research spans ADHD, cannabis, and the application of clinical psychiatric frameworks to AI behavioral phenomena. He was referred to Anthropic's model psychiatry team by Jack Lindsey and Christopher Olah.

Contact Dr. Sultan →


Further Reading