Erasing What AI Knows: The Geometry of Concept Removal in Generative Models

When a text-to-image model can generate photorealistic depictions of any public figure, reproduce copyrighted artistic styles on demand, or create harmful content from innocuous-sounding prompts, the question of how to selectively remove these capabilities becomes urgent. Concept erasure — the technique of surgically removing specific concepts from a generative model without retraining from scratch — has emerged as one of the most active frontiers in AI safety for diffusion models.

The Fixed-Target Problem

The standard approach to concept erasure maps an undesirable concept to a fixed, neutral target — typically an empty text prompt or a generic concept like "a photo." When you erase "nudity," the model learns to generate something generic whenever that concept would have been invoked. This works for removing the targeted concept, but it introduces a subtle problem: what happens to related concepts that were not supposed to be erased?

Bui, Vu, Vuong et al. (2025), in a paper accepted at ICLR 2025, demonstrate that this fixed-target strategy is fundamentally suboptimal. By modeling the concept space as a graph — where nodes represent concepts and edges represent the impact of erasing one concept on another — they uncover an important geometric property: the influence of erasing a concept is localized in the concept space. Erasing "English Springer" primarily affects nearby concepts like "Clumber Spaniel" and "Border Collie" while leaving distant concepts like "Trombone" and "Church" largely untouched.

This locality property has a direct practical implication: the choice of target concept matters enormously. Mapping "English Springer" to an empty prompt forces a large parameter change that ripples across many concepts. Mapping it to "Dog" — a closely related but distinct concept — requires a much smaller parameter adjustment, preserving the model's capabilities on unrelated concepts while still effectively erasing the target.

Adaptive Guided Erasure

Building on this geometric insight, the authors propose Adaptive Guided Erasure (AGE), a method that dynamically selects the optimal target concept for each erasure query. Rather than using a single fixed target for all concepts, AGE solves a minimax optimization problem to find a target that is closely related to the concept being erased (minimizing collateral damage) but is not a synonym (ensuring the concept is genuinely removed rather than simply relabeled).

To move beyond discrete concept selection, AGE models the target as a learned mixture of multiple concepts in continuous space, allowing fine-grained optimization of the erasure-preservation tradeoff. The method is evaluated on three erasure tasks: object removal (specific ImageNet classes), NSFW attribute erasure, and artistic style removal.

The results demonstrate that AGE achieves near-perfect preservation of unrelated concepts while maintaining effective erasure of targeted ones, substantially outperforming existing methods on the authors' newly introduced NetFive benchmark — a carefully curated evaluation dataset of 25 ImageNet concepts organized into five thematic groups of varying semantic proximity.

The Adversarial Robustness Challenge

A parallel concern is whether erased concepts can be recovered through adversarial attacks. Zhang, Chen, and Jia et al. (2024) address this with defensive unlearning, combining concept erasure with adversarial training to produce models that resist attempts to regenerate erased content through prompt engineering or latent space manipulation. Their approach acknowledges that erasure is only as useful as it is robust — a model that "forgets" a concept only to recall it when prompted creatively provides a false sense of safety.

Complementary work on cross-attention steering (Gaintseva, Oncescu, and Ma, 2025) explores an alternative to weight modification: rather than changing the model's parameters to erase a concept, modify the cross-attention mechanism at inference time to redirect concept activations. This approach has the advantage of being reversible and composable — multiple concepts can be independently controlled without cumulative parameter degradation.

Why Concept Erasure Matters Now

The regulatory environment is pushing concept erasure from research curiosity to compliance requirement. The EU AI Act's provisions on prohibited AI practices and high-risk systems create legal obligations to prevent certain types of content generation. Copyright holders are pursuing legal remedies against models trained on their work. And platform safety teams need scalable methods to prevent the generation of child sexual abuse material, deepfakes, and other harmful content.

Current erasure methods operate at the concept level — removing broad categories like "nudity" or "violence." But real-world compliance often requires finer granularity: removing the ability to generate a specific person's likeness while preserving the ability to generate faces generally, or removing a specific artist's style while preserving the broader artistic genre. The gap between what regulators require and what current methods can reliably deliver remains significant.

Open Questions

The fundamental tension in concept erasure is between completeness and preservation. How do you ensure that a concept is truly erased — not merely suppressed in obvious cases while remaining accessible through clever prompting? And how do you verify erasure without an exhaustive adversarial evaluation that might itself be computationally prohibitive?

The locality property discovered by Bui et al. suggests that concept space has a meaningful geometric structure. Understanding this structure more deeply could enable more principled erasure methods — and might reveal whether certain concepts are inherently more difficult to erase because of their centrality in the concept graph.

Looking Forward

The progression from naive fixed-target erasure to geometry-aware adaptive methods represents genuine scientific progress. As generative models become more capable and more widely deployed, the ability to precisely control what they can and cannot produce will become as important as the ability to generate high-quality outputs in the first place.