← All Posts

🧩 Concept Engineering

10 articles in Concept Engineering

Deep Dive
Representation engineering can detect and control high-level concepts like honesty and refusal inside LLMs. A new defense-in-depth system achieves 88% attack reduction by layering RepE with three complementary safety mechanisms.
representation engineeringLLM safetyactivation steering
Deep Dive
ICLR and NeurIPS 2025 papers introduce concept bottleneck architectures for LLMs — making every prediction traceable through human-interpretable concepts while matching black-box accuracy.
concept bottleneck modelsinterpretable AICB-LLM
Deep Dive
Scaling sparse autoencoders causes features to split, absorb, and merge. Matryoshka SAEs solve this by nesting dictionaries within dictionaries — learning general and specific concepts in a single model.
sparse autoencodersmechanistic interpretabilityMatryoshka SAE
Deep Dive
Universal Sparse Autoencoders reveal that diverse neural networks converge on shared concepts — enabling transferable interpretability, cross-model safety audits, and reusable steering tools.
universal SAEconcept alignmentcross-model interpretability
Deep Dive
Fixed-target concept erasure damages unrelated capabilities. ICLR 2025 work reveals the geometric structure of concept space and introduces adaptive erasure that preserves what should remain.
concept erasurediffusion modelsmachine unlearning
Deep Dive
ICLR 2025 research reveals that LLMs internally encode correct answers while generating incorrect ones. New methods exploit this discrepancy for real-time, query-specific truthfulness correction.
hallucination detectiontruthfulnessrepresentation engineering
Deep Dive
When AI diagnoses disease, clinicians need to know why. Concept Bottleneck Models force predictions through human-interpretable concepts — and recent advances in clinical knowledge integration, multi-layer preference modeling, and 3D imaging are bringing this architecture from theory to clinical practice.
concept-bottleneck-modelsmedical-AIinterpretable-AI
Deep Dive
Standard AI explanations show what correlated with the output. Causal concept graph models show what caused it — enabling counterfactual reasoning, interventional analysis, and the kind of meaningful oversight regulators are demanding.
causal-AIconcept-engineeringexplainable-AI
Deep Dive
Catastrophic forgetting is the central challenge of continual learning. Concept bottleneck architectures offer a structural solution — organizing knowledge through interpretable concepts that resist overwriting, transfer across tasks, and scale to unseen classes.
continual-learningconcept-bottleneck-modelscatastrophic-forgetting
Deep Dive
The interpretability tools built for language models — sparse autoencoders, activation steering, concept probing — are proving equally effective on physics simulators, protein models, and music generators. The convergence suggests something universal about how foundation models organize knowledge.
concept-engineeringrepresentation-engineeringphysics-foundation-models