Concept Bottleneck LLMs: Building Language Models That Show Their Reasoning

When a language model classifies a restaurant review as negative, it gives you an answer but not its reasoning. Was it the mention of cold food? Rude service? Overpricing? Traditional LLMs operate as black boxes, making it impossible to audit, correct, or trust their decisions in high-stakes applications. Concept Bottleneck Models offer a different path: force the model to reason through human-interpretable concepts before making a prediction, making every decision transparent and correctable.

From Vision to Language: The CBM Migration

Concept Bottleneck Models were originally developed for image classification, where an intermediate layer of human-understandable concepts (wing color, beak shape) mediates between input and prediction. But extending this architecture to large language models posed substantial challenges. Prior attempts were limited to small text classification tasks, required expensive LLM API calls for every sample, and could not handle text generation — the capability that makes modern LLMs useful.

Sun, Oikarinen, Ustun, and Weng (2025), in a paper accepted at ICLR 2025, introduce CB-LLMs — the framework that brings concept bottleneck architecture to large language models at scale. Their approach works through a five-step pipeline: generate a concept set using ChatGPT, automatically score concept presence using sentence embedding models rather than expensive API calls, correct scoring errors through an automated concept correction step, train a Concept Bottleneck Layer on top of the pretrained LLM, and finally train a linear prediction layer.

The key innovation is the Automatic Concept Scoring strategy, which uses off-the-shelf sentence embedding models to measure concept-text similarity. This eliminates the need for GPT-4 API calls on every sample — a limitation that made previous approaches prohibitively expensive for large datasets. CB-LLMs scale to datasets with over 500,000 samples while maintaining competitive accuracy.

For text classification, CB-LLMs match the accuracy of black-box models while achieving 1.5 times higher faithfulness scores than existing interpretable approaches. For text generation — a domain no prior CBM work had addressed — the interpretable neurons enable precise concept detection, controlled generation, and safer outputs. The authors demonstrate an inherently interpretable chatbot that can detect toxic queries, trace the reasoning behind harmful outputs, and steer generation toward safer responses.

The Bayesian Extension: Infinite Concept Search

A complementary approach by Feng, Kothari, Zier, Singh, and Tan (2024), presented at NeurIPS 2025, addresses a different limitation: how do you know you have the right concepts? Standard CBMs require a predefined concept set, but in complex domains like healthcare, the number of relevant concepts may be effectively infinite. A concept like "smoking status" branches into refinements — whether a patient has quit, how recently, what type — that a predefined list might miss entirely.

BC-LLM wraps the LLM within a Bayesian posterior inference procedure, iteratively searching over a potentially infinite concept space. The LLM serves dual roles: generating candidate concepts and providing prior probabilities for their relevance. A Metropolis accept/reject step corrects for LLM hallucinations and inconsistencies, ensuring that the system converges to genuinely relevant concepts even when the LLM's priors are imperfect.

The results demonstrate that BC-LLM substantially improves the interpretability-accuracy tradeoff across image, text, and tabular data, sometimes outperforming black-box models entirely. In a clinical collaboration, hospital data scientists found the BC-LLM outputs to be more interpretable and actionable than their existing models. The framework also provides calibrated uncertainty quantification — knowing not just what concepts are relevant but how confident the system is in each assessment.

Why This Matters for Trustworthy AI

The convergence of these approaches points toward a future where interpretability is not an afterthought but an architectural feature. Several properties make concept bottleneck LLMs particularly relevant for high-stakes applications:

Auditability. Because every prediction passes through human-interpretable concepts, domain experts can examine exactly which concepts drove a particular decision. In healthcare, this means a clinician can verify that a diagnostic recommendation is based on medically relevant features rather than spurious correlations.

Intervenability. If a concept is scored incorrectly, users can manually correct it and observe how the prediction changes. This creates a structured interface for human-AI collaboration where experts can inject domain knowledge directly into the model's reasoning process.

Concept unlearning. CB-LLMs enable the selective removal of specific concepts from a model's reasoning — a capability with direct applications in safety (removing harmful concepts) and privacy (removing personally identifiable concepts from model decisions).

Open Questions

The accuracy-interpretability tradeoff, while improved, has not been eliminated. CB-LLMs match but do not consistently exceed black-box performance, raising the question of whether inherent interpretability always comes at some cost. The reliance on LLMs for concept generation introduces a dependency — if the concept-generating model has blind spots, the bottleneck will inherit them.

Perhaps the most fundamental question is whether the concepts that make sense to humans are the same concepts that are most useful for prediction. The success of CB-LLMs suggests significant overlap, but the gap between human-interpretable and machine-optimal concepts remains an active area of investigation.

Looking Forward

The acceptance of CB-LLMs at ICLR 2025 and BC-LLM at NeurIPS 2025 signals that concept bottleneck approaches have reached a level of maturity that the top venues recognize. As regulatory frameworks like the EU AI Act increasingly demand explainability in high-risk AI systems, architectures that build interpretability into the model rather than adding it post-hoc may shift from academic interest to practical necessity.