The AI That Shows Its Work: How Concept Bottleneck Models Are Making Medical Diagnosis Transparent

An AI system examines a blood smear and declares with 94% confidence: this cell is an eosinophil. The hematologist asks: why? The system has no answer. It simply mapped pixels to a label through millions of inscrutable parameters. This opacity is not merely an academic inconvenience — it is the central barrier preventing deep learning from achieving clinical adoption in pathology, radiology, and genomics. A model that cannot explain itself cannot be trusted with a diagnosis.

The Bottleneck That Explains

Concept Bottleneck Models, first proposed by Koh et al. (2020) at ICML, offer a structurally different approach. Instead of mapping inputs directly to outputs through a black box, CBMs insert an intermediate layer of human-interpretable concepts. A pathology CBM, for instance, would first predict observable properties — nucleus shape, cytoplasm color, granule density — and then use those concept predictions to make its final classification. The architecture enforces a causal chain: image leads to concepts, concepts lead to diagnosis. Every prediction comes with an explanation in terms clinicians already understand.

The design carries a practical advantage that goes beyond transparency. Because the bottleneck layer speaks in clinical language, a physician can intervene. If the model incorrectly predicts "bilobed nucleus" when the pathologist can see a round one, the concept can be corrected at test time, and the downstream prediction updates accordingly. This test-time intervention capability — unique to CBMs among current interpretable architectures — creates a human-AI collaboration loop that aligns naturally with clinical workflows.

When Data-Driven Models Learn the Wrong Priorities

A fundamental challenge emerges when CBMs are trained purely on data. Pang, Ke, Tsutsui, and Wen (2024), working at Nanyang Technological University, demonstrate that data-driven CBMs can learn concept priorities that diverge from clinical reasoning. Their experiments on white blood cell classification reveal the problem: a model might rely heavily on background staining patterns rather than the morphological features pathologists actually use. Under domain shift — when images come from different laboratories with different preparation protocols — these spurious associations collapse, and performance degrades.

Their solution integrates clinical knowledge directly into the training process. Rather than treating all concepts as equally important, they enforce alignment between the model's concept reliance and expert-defined priorities. When clinicians indicate that granule color is the decisive feature for identifying eosinophils while cytoplasm texture matters more for monocytes, the training process ensures the model reflects these class-specific priorities. The mechanism works by measuring the drop in prediction probability when each concept is removed: clinically important concepts must produce large drops, forcing the model to genuinely rely on them.

The results are telling. On in-distribution data, the knowledge-guided model performs comparably to its purely data-driven counterpart. The difference appears under domain shift — precisely the condition that matters for real-world deployment across hospitals with different equipment and preparation methods. Here, the clinically guided CBM maintains substantially higher accuracy, because it has learned to rely on the same features that generalize across clinical settings: the features pathologists have identified through decades of practice.

The Layer Preference Discovery

A second limitation of existing medical CBMs concerns where in the visual encoder the concept information is extracted. Wang, Zhang, Liu et al. (2025) at the University of Science and Technology of China identify what they call concept preference variation — the empirically verified finding that different medical concepts are best represented at different layers of a visual encoder, not uniformly at the final layer.

Consider what this means in practice. Low-level visual features like texture and edge patterns emerge in early layers. Higher-level semantic features like spatial relationships between structures form in deeper layers. A concept like "irregular border" might be best captured at an intermediate layer, while "asymmetric cell distribution" requires the deeper representational capacity of later layers. Standard CBMs, by extracting all concepts from the final layer, force every concept through the same representational bottleneck — missing the layer where some concepts achieve their sharpest encoding.

Their proposed architecture, MVP-CBM, introduces two mechanisms to address this. First, an intra-layer concept preference module learns which concepts prefer which layers, assigning each concept-layer pair a preference weight. Second, a sparse activation fusion module aggregates concept signals across layers while maintaining sparsity — ensuring each concept draws primarily from its preferred layer rather than blending signals indiscriminately. The architecture processes the same input through a standard visual encoder but reads concept information from multiple depths, matching each concept to its natural level of abstraction.

On medical image classification benchmarks — including skin lesion and retinal disease datasets — MVP-CBM achieves higher accuracy while providing more faithful concept-level explanations. The concept predictions better correspond to ground-truth annotations, suggesting that layer-aware extraction does not merely improve performance but improves the quality of the interpretable bottleneck itself.

Extending to Three Dimensions

The CBM framework is also expanding beyond two-dimensional imaging. Khaled and Al-Kabbany (2026) extend concept bottleneck models to three-dimensional medical data for intracranial aneurysm classification using 3D imaging. Aneurysms present a particularly compelling use case: the decision to treat is high-stakes, the relevant concepts (shape, size, location relative to vessels) are well-defined by neurosurgeons, and current deep learning approaches offer no explanation for their risk stratification. A 3D CBM that predicts morphological concepts before classifying rupture risk would give neurosurgeons exactly the kind of structured reasoning they need to evaluate the model's judgment against their own clinical experience.

The Regulatory Convergence

These technical advances arrive at a moment of regulatory convergence. The EU AI Act classifies medical diagnostic systems as high-risk, requiring transparency and human oversight. The FDA's framework for AI-enabled medical devices increasingly emphasizes the need for clinician-comprehensible explanations. CBMs are not merely one approach among many to interpretability — they are architecturally aligned with what regulators are asking for: predictions that route through concepts a human expert can evaluate, correct, and override.

The remaining challenges are substantial. Concept annotation requires expert time. The set of relevant concepts must be defined before training, which means the model cannot discover genuinely novel diagnostic features. And the bottleneck introduces an information constraint — by definition, any information not captured by the concept set is discarded. Whether the clinical benefits of interpretability outweigh this capacity limitation is an empirical question that each application domain must answer.

What is no longer in question is the direction. When the stakes include human health, the demand is not merely for accurate predictions but for accountable ones. Concept bottleneck models offer a path where accuracy and accountability are not competing objectives but architectural complements — where the model's explanation is not a post-hoc narrative but the mechanism through which the prediction is made.