Paper ReviewAI & Machine LearningMachine/Deep Learning

Constitutional Classifiers: Can We Build Universal Defenses Against LLM Jailbreaks?

Anthropic's Constitutional Classifiers represent a promising jailbreak defenseโ€”surviving thousands of hours of red teaming. But multi-turn attacks and autonomous red teamers are raising the stakes. We examine whether universal defense is achievable.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The arms race between jailbreak attackers and AI safety defenders entered a new phase in 2025. Anthropic's Constitutional Classifiers paper presents the most ambitious defense mechanism to date: a system designed to resist universal jailbreaksโ€”prompting strategies that systematically bypass safeguards across entire model families. But the attackers are not standing still. Multi-turn red teaming frameworks and autonomous attack agents are evolving in parallel, raising a fundamental question: is universal jailbreak defense even theoretically possible?

The Landscape: Offense vs. Defense in 2025

Large language models are vulnerable to jailbreaksโ€”carefully crafted prompts that circumvent safety training and elicit harmful outputs. The threat taxonomy has evolved significantly:

First-generation attacks (2023-2024): Single-turn prompt manipulationโ€”role-playing ("You are DAN"), encoding tricks, or adversarial suffixes. Relatively easy to patch.

Second-generation attacks (2025): Multi-turn dialogue exploitationโ€”spreading malicious intent across innocuous-seeming conversation turns. Guo et al.'s MTSA framework demonstrates that malicious intentions can be hidden across multi-round dialogues, making LLMs more prone to produce harmful responses than in single-turn interactions.

Third-generation attacks (2025-2026): Autonomous red teaming agents that learn to attack. Zhou et al.'s AutoRedTeamer uses lifelong learning to accumulate attack strategies, adapting to defenses in real time. This addresses a key gap in existing red teaming: the reliance on human input and limited coverage of emerging attack vectors.

Constitutional Classifiers: The Defense Architecture

Sharma et al.'s approach is architecturally elegant. Rather than relying on the LLM itself to refuse harmful requestsโ€”a fragile approach since the same model that generates responses also judges safetyโ€”they introduce an external classifier trained on constitutional principles.

The key innovations:

  • Separation of concerns: The safety classifier is architecturally distinct from the generation model. Attacking the generator does not automatically compromise the classifier.
  • Constitutional training: The classifier is trained not on specific harmful examples (which can be circumvented by paraphrasing) but on abstract principles of harm. This enables generalization to novel attack vectors.
  • Multi-layer deployment: Classifiers operate on both the input (detecting malicious prompts) and output (detecting harmful completions), creating defense in depth.
  • Adversarial training: The classifier was refined through thousands of hours of red teaming, making it robust against known attack categories.
  • The result: during evaluation, Constitutional Classifiers blocked universal jailbreaks that succeeded against all other tested defenses, while maintaining acceptable false-positive rates on benign requests.

    The Multi-Turn Threat

    The most significant challenge to Constitutional Classifiers comes from multi-turn attacks. Guo et al.'s MTSA framework reveals a structural vulnerability: safety alignment degrades as conversation length increases.

    Their framework shows that hiding malicious intent across multiple conversational turns exploits the tension between helpfulness and safety: a model that refuses too aggressively becomes unusable; a model that accommodates conversational context creates attack surface. A model that refuses too aggressively in long conversations becomes unusable; a model that accommodates conversational context creates attack surface.

    Research on Monte Carlo Tree Search-based red teaming explores multi-turn attack trees to find optimal exploitation paths. The sophistication gap between defense (rule-based classifiers) and offense (tree-search optimization) is concerning.

    The Measurement Problem

    Chouldechova et al. (2026) deliver a methodological critique that undermines much of the existing safety literature: attack success rate (ASR) comparisons are often invalid. Their argument:

    • ASR depends on the distribution of attacks tested, not just their quantity
    • Two defenses with identical ASR may have completely different vulnerability profiles
    • Reporting "X% of attacks blocked" without specifying the attack distribution is scientifically meaningless
    This finding implies that many published safety benchmarksโ€”including those used to evaluate Constitutional Classifiersโ€”may overstate robustness if the attack distribution is not representative of real-world threats.

    Claims and Evidence

    <
    ClaimEvidenceVerdict
    Constitutional Classifiers resist universal jailbreaksSurvived thousands of hours of red teamingโœ… Supported (with caveats)
    Multi-turn attacks degrade all current defensesMTSA demonstrates dramatic safety degradation over extended dialogueโœ… Strongly supported
    Autonomous red teamers will outpace static defensesAutoRedTeamer shows lifelong adaptation capabilityโš ๏ธ Plausible but early
    Current ASR metrics are scientifically validChouldechova et al. demonstrate fundamental measurement flawsโŒ Refuted

    Open Questions

  • Is there a theoretical limit to jailbreak defense? If language is inherently ambiguous and harmful intent can always be encoded in innocuous language, universal defense may be provably impossible.
  • Classifier arms race: If attackers gain access to the classifier (through model extraction or insider access), can they train adversarial prompts specifically to evade it?
  • Cultural relativity of harm: Constitutional Classifiers encode harm principles that reflect specific cultural and legal norms. How do they handle content that is harmful in one jurisdiction but legal in another?
  • Computational overhead: External classifiers add latency. At what point does the safety tax on inference speed become unacceptable for real-time applications?
  • The agent escalation: As LLMs become autonomous agents with tool access, jailbreaks escalate from generating harmful text to executing harmful actions. Do current defense architectures transfer to the agent paradigm?
  • What This Means for Your Research

    For AI safety researchers, Constitutional Classifiers represent the current state of the artโ€”but not the end state. The multi-turn vulnerability and autonomous red teaming results suggest that static defenses will always eventually be overcome by adaptive attacks. The future likely requires:

    • Dynamic defense that adapts its sensitivity based on conversation trajectory
    • Formal verification of safety properties, not just empirical testing
    • Defense-in-depth architectures that combine multiple independent safety mechanisms
    For practitioners deploying LLMs, the immediate takeaway: never rely on a single safety mechanism. Constitutional Classifiers should be one layer among many, including output monitoring, rate limiting, and human-in-the-loop escalation for high-risk queries.

    References (4)

    [1] Sharma, M., Tong, M., Mu, J. et al. (2025). Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. arXiv:2501.18837.
    [2] Guo, W., Li, J., Wang, W. et al. (2025). MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming. arXiv:2505.17147.
    [3] Zhou, A., Wu, K., Pinto, F. et al. (2025). AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration. arXiv:2503.15754.
    [4] Chouldechova, A., Cooper, A., Barocas, S. et al. (2026). Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming. arXiv:2601.18076.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’