Paper ReviewAI & Machine LearningMachine/Deep Learning

Constitutional Classifiers: Can We Build Universal Defenses Against LLM Jailbreaks?

Anthropic's Constitutional Classifiers represent a promising jailbreak defense—surviving thousands of hours of red teaming. But multi-turn attacks and autonomous red teamers are raising the stakes. We examine whether universal defense is achievable.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The arms race between jailbreak attackers and AI safety defenders entered a new phase in 2025. Anthropic's Constitutional Classifiers paper presents the most ambitious defense mechanism to date: a system designed to resist universal jailbreaks—prompting strategies that systematically bypass safeguards across entire model families. But the attackers are not standing still. Multi-turn red teaming frameworks and autonomous attack agents are evolving in parallel, raising a fundamental question: is universal jailbreak defense even theoretically possible?

The Landscape: Offense vs. Defense in 2025

Large language models are vulnerable to jailbreaks—carefully crafted prompts that circumvent safety training and elicit harmful outputs. The threat taxonomy has evolved significantly:

First-generation attacks (2023-2024): Single-turn prompt manipulation—role-playing ("You are DAN"), encoding tricks, or adversarial suffixes. Relatively easy to patch.

Second-generation attacks (2025): Multi-turn dialogue exploitation—spreading malicious intent across innocuous-seeming conversation turns. Guo et al.'s MTSA framework demonstrates that malicious intentions can be hidden across multi-round dialogues, making LLMs more prone to produce harmful responses than in single-turn interactions.

Third-generation attacks (2025-2026): Autonomous red teaming agents that learn to attack. Zhou et al.'s AutoRedTeamer uses lifelong learning to accumulate attack strategies, adapting to defenses in real time. This addresses a key gap in existing red teaming: the reliance on human input and limited coverage of emerging attack vectors.

Constitutional Classifiers: The Defense Architecture

Sharma et al.'s approach is architecturally elegant. Rather than relying on the LLM itself to refuse harmful requests—a fragile approach since the same model that generates responses also judges safety—they introduce an external classifier trained on constitutional principles.

The key innovations:

Separation of concerns: The safety classifier is architecturally distinct from the generation model. Attacking the generator does not automatically compromise the classifier.

Constitutional training: The classifier is trained not on specific harmful examples (which can be circumvented by paraphrasing) but on abstract principles of harm. This enables generalization to novel attack vectors.

Multi-layer deployment: Classifiers operate on both the input (detecting malicious prompts) and output (detecting harmful completions), creating defense in depth.

Adversarial training: The classifier was refined through thousands of hours of red teaming, making it robust against known attack categories.

The result: during evaluation, Constitutional Classifiers blocked universal jailbreaks that succeeded against all other tested defenses, while maintaining acceptable false-positive rates on benign requests.

The Multi-Turn Threat

The most significant challenge to Constitutional Classifiers comes from multi-turn attacks. Guo et al.'s MTSA framework reveals a structural vulnerability: safety alignment degrades as conversation length increases.

Their framework shows that hiding malicious intent across multiple conversational turns exploits the tension between helpfulness and safety: a model that refuses too aggressively becomes unusable; a model that accommodates conversational context creates attack surface. A model that refuses too aggressively in long conversations becomes unusable; a model that accommodates conversational context creates attack surface.

Research on Monte Carlo Tree Search-based red teaming explores multi-turn attack trees to find optimal exploitation paths. The sophistication gap between defense (rule-based classifiers) and offense (tree-search optimization) is concerning.

The Measurement Problem

Chouldechova et al. (2026) deliver a methodological critique that undermines much of the existing safety literature: attack success rate (ASR) comparisons are often invalid. Their argument:

ASR depends on the distribution of attacks tested, not just their quantity
Two defenses with identical ASR may have completely different vulnerability profiles
Reporting "X% of attacks blocked" without specifying the attack distribution is scientifically meaningless

This finding implies that many published safety benchmarks—including those used to evaluate Constitutional Classifiers—may overstate robustness if the attack distribution is not representative of real-world threats.

Claims and Evidence

Claim	Evidence	Verdict
Constitutional Classifiers resist universal jailbreaks	Survived thousands of hours of red teaming	✅ Supported (with caveats)
Multi-turn attacks degrade all current defenses	MTSA demonstrates dramatic safety degradation over extended dialogue	✅ Strongly supported
Autonomous red teamers will outpace static defenses	AutoRedTeamer shows lifelong adaptation capability	⚠️ Plausible but early
Current ASR metrics are scientifically valid	Chouldechova et al. demonstrate fundamental measurement flaws	❌ Refuted

Open Questions

Is there a theoretical limit to jailbreak defense? If language is inherently ambiguous and harmful intent can always be encoded in innocuous language, universal defense may be provably impossible.

Classifier arms race: If attackers gain access to the classifier (through model extraction or insider access), can they train adversarial prompts specifically to evade it?

Cultural relativity of harm: Constitutional Classifiers encode harm principles that reflect specific cultural and legal norms. How do they handle content that is harmful in one jurisdiction but legal in another?

Computational overhead: External classifiers add latency. At what point does the safety tax on inference speed become unacceptable for real-time applications?

The agent escalation: As LLMs become autonomous agents with tool access, jailbreaks escalate from generating harmful text to executing harmful actions. Do current defense architectures transfer to the agent paradigm?

What This Means for Your Research

For AI safety researchers, Constitutional Classifiers represent the current state of the art—but not the end state. The multi-turn vulnerability and autonomous red teaming results suggest that static defenses will always eventually be overcome by adaptive attacks. The future likely requires:

Dynamic defense that adapts its sensitivity based on conversation trajectory
Formal verification of safety properties, not just empirical testing
Defense-in-depth architectures that combine multiple independent safety mechanisms

For practitioners deploying LLMs, the immediate takeaway: never rely on a single safety mechanism. Constitutional Classifiers should be one layer among many, including output monitoring, rate limiting, and human-in-the-loop escalation for high-risk queries.

면책 조항: 본 게시물은 정보 제공 목적의 연구 동향 개요이다. 특정 연구 결과, 통계, 주장은 학술 저작물에 인용하기 전에 원본 논문과 대조하여 검증해야 한다.

Constitutional Classifiers: LLM 탈옥에 대한 범용 방어체계를 구축할 수 있는가?

탈옥 공격자와 AI 안전 방어자 간의 군비 경쟁은 2025년 새로운 국면에 접어들었다. Anthropic의 Constitutional Classifiers 논문은 현재까지 제시된 것 중 가장 야심 찬 방어 메커니즘을 소개한다. 이는 범용 탈옥—모델 계열 전체에 걸쳐 안전장치를 체계적으로 우회하는 프롬프팅 전략—에 저항하도록 설계된 시스템이다. 그러나 공격자들도 가만히 있지 않는다. 다중 턴(multi-turn) 레드팀 프레임워크와 자율 공격 에이전트가 병행하여 진화하고 있으며, 이는 근본적인 질문을 제기한다. 범용 탈옥 방어는 이론적으로라도 가능한가?

현황: 2025년의 공격과 방어

대규모 언어 모델(LLM)은 안전 학습을 우회하여 유해한 출력을 유도하는 정교하게 설계된 프롬프트, 즉 탈옥에 취약하다. 위협 분류 체계는 상당히 진화해 왔다.

1세대 공격 (2023-2024): 단일 턴 프롬프트 조작—역할극("당신은 DAN입니다"), 인코딩 트릭, 또는 적대적 접미사. 비교적 패치하기 쉽다.

2세대 공격 (2025): 다중 턴 대화 악용—무해해 보이는 여러 대화 턴에 걸쳐 악의적 의도를 분산시킨다. Guo et al.의 MTSA 프레임워크는 악의적 의도가 여러 라운드의 대화에 걸쳐 숨겨질 수 있음을 증명하며, 이로 인해 LLM이 단일 턴 상호작용에 비해 유해한 응답을 생성하기 더 쉬워진다는 점을 보여준다.

3세대 공격 (2025-2026): 공격 방법을 스스로 학습하는 자율 레드팀 에이전트. Zhou et al.의 AutoRedTeamer는 평생 학습(lifelong learning)을 활용하여 공격 전략을 축적하고 실시간으로 방어에 적응한다. 이는 기존 레드팀의 핵심적 한계—인간의 입력 의존성과 새롭게 등장하는 공격 벡터에 대한 제한적 포괄범위—를 해소한다.

Constitutional Classifiers: 방어 아키텍처

Sharma et al.의 접근 방식은 아키텍처 측면에서 우아하다. 유해한 요청을 거부하는 것을 LLM 자체에 의존하는 대신—응답을 생성하는 동일한 모델이 안전성도 판단한다는 점에서 취약한 접근법—외부 분류기(classifier)를 도입하여 헌법적 원칙(constitutional principles)에 따라 학습시킨다.

핵심 혁신 사항은 다음과 같다.

관심사의 분리: 안전 분류기는 생성 모델과 아키텍처적으로 분리되어 있다. 생성기를 공격하더라도 분류기가 자동으로 손상되지 않는다.

헌법적 학습: 분류기는 특정 유해 사례(paraphrasing으로 우회 가능한)가 아니라 해악의 추상적 원칙에 기반하여 학습된다. 이를 통해 새로운 공격 벡터에 대한 일반화가 가능해진다.

다층 배포: 분류기는 입력(악의적 프롬프트 탐지)과 출력(유해한 완성 탐지) 양쪽에서 작동하여 심층 방어를 구현한다.

적대적 학습: 분류기는 수천 시간의 레드팀 작업을 통해 정제되어 알려진 공격 범주에 대한 강건성을 확보하였다.

그 결과, 평가 과정에서 Constitutional Classifiers는 다른 모든 테스트된 방어를 성공적으로 우회한 범용 탈옥을 차단하면서도, 양성 요청에 대한 허용 가능한 수준의 오탐(false-positive) 비율을 유지하였다.

다중 턴 위협

Constitutional Classifiers에 대한 가장 중대한 도전은 다중 턴 공격에서 비롯된다. Guo et al.의 MTSA 프레임워크는 구조적 취약점을 드러낸다. 대화 길이가 증가할수록 안전 정렬(safety alignment)이 저하된다는 것이다.

이 프레임워크는 여러 대화 턴에 걸쳐 악의적 의도를 숨기는 방식이 유용성과 안전성 사이의 긴장을 악용함을 보여준다. 지나치게 공격적으로 거부하는 모델은 사용 불가능해지고, 대화 맥락을 수용하는 모델은 공격 표면을 만들어낸다. 긴 대화에서 지나치게 공격적으로 거부하는 모델은 사용 불가능해지고, 대화 맥락을 수용하는 모델은 공격 표면을 만들어낸다.

측정 문제

Chouldechova et al. (2026)은 기존 안전성 연구의 상당 부분을 뒤흔드는 방법론적 비판을 제기한다: 공격 성공률(ASR) 비교는 종종 유효하지 않다. 그들의 주장은 다음과 같다:

ASR은 테스트된 공격의 수량뿐만 아니라 공격의 분포에 따라 달라진다
동일한 ASR을 보이는 두 방어 체계는 완전히 다른 취약성 프로파일을 가질 수 있다
공격 분포를 명시하지 않고 "X%의 공격을 차단했다"고 보고하는 것은 과학적으로 무의미하다

이 발견은 Constitutional Classifiers 평가에 사용된 것을 포함한 다수의 기존 안전성 벤치마크가 공격 분포가 실제 위협을 대표하지 않을 경우 견고성을 과장하고 있을 수 있음을 시사한다.

Monte Carlo Tree Search 기반의 레드팀 연구는 최적의 공격 경로를 탐색하기 위해 다중 턴(multi-turn) 공격 트리를 탐구한다. 방어(규칙 기반 분류기)와 공격(트리 탐색 최적화) 사이의 정교함 격차는 우려스러운 수준이다.

주장과 근거

주장	근거	판정
Constitutional Classifiers는 범용 탈옥(jailbreak)에 저항한다	수천 시간의 레드팀 테스트에서 유지됨	✅ 지지됨 (단서 포함)
다중 턴 공격은 현재의 모든 방어 체계를 약화시킨다	MTSA는 장기 대화에서 극적인 안전성 저하를 입증	✅ 강력히 지지됨
자율 레드팀은 정적 방어를 앞지를 것이다	AutoRedTeamer는 지속적 적응 능력을 시연	⚠️ 그럴듯하나 초기 단계
현재의 ASR 지표는 과학적으로 유효하다	Chouldechova et al.이 근본적인 측정 결함을 입증	❌ 반박됨

미해결 질문들

탈옥 방어에 이론적 한계가 존재하는가? 언어가 본질적으로 모호하고 해로운 의도가 언제나 무해한 언어로 인코딩될 수 있다면, 범용 방어는 증명 가능한 수준에서 불가능할 수 있다.

분류기 군비 경쟁: 공격자가 모델 추출이나 내부자 접근을 통해 분류기에 접근할 수 있게 된다면, 이를 회피하도록 특별히 설계된 적대적 프롬프트를 학습시킬 수 있는가?

피해의 문화적 상대성: Constitutional Classifiers는 특정 문화적·법적 규범을 반영하는 피해 원칙을 인코딩한다. 한 법적 관할권에서는 유해하지만 다른 곳에서는 합법인 콘텐츠를 어떻게 처리하는가?

계산 오버헤드: 외부 분류기는 지연 시간을 추가한다. 추론 속도에 부과되는 안전성 비용이 실시간 애플리케이션에서 허용 불가한 수준이 되는 시점은 언제인가?

에이전트 확장 문제: LLM이 도구 접근 권한을 가진 자율 에이전트로 발전함에 따라, 탈옥은 유해한 텍스트 생성에서 유해한 행동 실행으로 확대된다. 현재의 방어 아키텍처는 에이전트 패러다임에도 적용 가능한가?

연구에 대한 시사점

AI 안전성 연구자들에게 Constitutional Classifiers는 현재의 최신 기술 수준을 대표하지만, 최종적인 해결책은 아니다. 다중 턴 취약성과 자율 레드팀 결과는 정적 방어가 결국 항상 적응형 공격에 의해 극복될 것임을 시사한다. 미래에는 아마도 다음이 요구될 것이다:

대화 흐름에 따라 민감도를 조정하는 동적 방어
단순한 경험적 테스트가 아닌 안전성 속성의 공식 검증
여러 독립적인 안전 메커니즘을 결합한 심층 방어(defense-in-depth) 아키텍처

LLM을 배포하는 실무자들에게 즉각적인 시사점은 다음과 같다: 단일 안전 메커니즘에만 의존하지 말 것. Constitutional Classifiers는 출력 모니터링, 속도 제한, 고위험 쿼리에 대한 인간 참여 에스컬레이션을 포함한 여러 계층 중 하나로 활용되어야 한다.

References (4)

[1] Sharma, M., Tong, M., Mu, J. et al. (2025). Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming. arXiv:2501.18837.

DOI Scholar

[2] Guo, W., Li, J., Wang, W. et al. (2025). MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming. arXiv:2505.17147.

DOI Scholar

[3] Zhou, A., Wu, K., Pinto, F. et al. (2025). AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration. arXiv:2503.15754.

DOI Scholar

[4] Chouldechova, A., Cooper, A., Barocas, S. et al. (2026). Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming. arXiv:2601.18076.

DOI Scholar

Constitutional Classifiers: Can We Build Universal Defenses Against LLM Jailbreaks?

The Landscape: Offense vs. Defense in 2025

Constitutional Classifiers: The Defense Architecture

The Multi-Turn Threat

The Measurement Problem

Claims and Evidence

Open Questions

What This Means for Your Research

Constitutional Classifiers: LLM 탈옥에 대한 범용 방어체계를 구축할 수 있는가?

현황: 2025년의 공격과 방어

Constitutional Classifiers: 방어 아키텍처

다중 턴 위협

측정 문제

주장과 근거

미해결 질문들

연구에 대한 시사점

References (4)

Explore this topic deeper