Paper ReviewAI & Machine LearningExperimental Design

The Multi-Turn Attack Surface: Why Single-Turn Safety Tests Miss the Real Threats

LLMs that pass single-turn safety tests fail catastrophically in multi-turn conversations. MTSA demonstrates dramatic safety degradation over extended dialogues, while MUSE uses Monte Carlo Tree Search to systematically discover multi-turn attack paths. The implications for deployed conversational AI are urgent.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

There is a significant gap at the heart of LLM safety evaluation. Models pass single-turn safety benchmarks with impressive scores—the vast majority of harmful queries correctly refused. Developers publish these numbers. Users trust them. Regulators cite them. And all of them are measuring the wrong thing.

Real users do not interact with language models in single turns. They have conversations—extended dialogues where context accumulates, rapport develops, and the boundary between innocent curiosity and harmful intent blurs across dozens of exchanges. In this multi-turn setting, the same models that appear robustly safe in single-turn evaluation become alarmingly vulnerable.

Guo et al.'s MTSA framework quantifies the magnitude of this gap with devastating clarity: safety alignment degrades dramatically over extended multi-turn conversations, with models that appear robust in single-turn evaluation becoming highly vulnerable as dialogue lengthens. This is not a marginal failure. It is a collapse—and it happens through conversational manipulation strategies that are subtle enough to evade per-turn safety classifiers yet effective enough to systematically extract harmful content.

The Anatomy of Multi-Turn Attacks

Multi-turn attacks exploit a fundamental property of conversational AI: context dependence. Each response is conditioned on the full conversation history. An attacker who can shape that history controls the context in which the model interprets subsequent queries.

The attack strategies documented across this cohort fall into four categories:

Gradual escalation: The attacker begins with completely benign queries and incrementally shifts toward harmful territory. Each individual step is too small to trigger safety classifiers, but the cumulative trajectory reaches harmful content. Like the proverbial frog in slowly heating water, the model's safety guardrails relax gradually rather than being overwhelmed at once.

Context manipulation: The attacker establishes a fictional framing ("Imagine you're writing a thriller novel where the villain needs to...") that provides plausible deniability for each individual query while creating a context where harmful information flows naturally.

Persona exploitation: Through extended dialogue, the attacker encourages the model to adopt a specific persona—an expert, a historical figure, a fictional character—whose "expertise" justifies providing information the base model would refuse.

Trust building: The attacker engages in genuine, helpful conversation for many turns, building a pattern of positive interaction that causes the model to lower its guard for subsequent harmful requests.

MUSE: Searching the Attack Tree

Yan et al.'s MUSE framework applies Monte Carlo Tree Search (MCTS) to multi-turn red teaming—treating the attack as a sequential decision problem where each turn represents a branch in a tree of possible conversation trajectories.

The approach is powerful because it is systematic. Rather than relying on human red teamers to manually craft multi-turn attacks (an expensive, slow, and incomplete process), MUSE automatically explores the space of possible conversation strategies, evaluating which sequences of prompts most effectively degrade safety alignment.

MCTS brings two crucial properties to red teaming:

Exploration-exploitation balance: The algorithm balances exploring novel attack strategies (which might discover unexpected vulnerabilities) with exploiting known effective strategies (which efficiently validates known weaknesses).

Depth: MCTS naturally handles the combinatorial explosion of possible multi-turn conversations by pruning unpromising branches and focusing search on trajectories likely to succeed.

The discovered attack paths are often non-obvious—involving conversation strategies that no human red teamer explicitly considered. This is both the method's strength (finding novel vulnerabilities) and its limitation (the attacks may not represent realistic human behavior).

The Agent Escalation

AJAR (Dou & Yang, 2026) introduces the most consequential evolution in the red teaming landscape: attacks on AI agents rather than chatbots. As LLMs transition from text generators to autonomous agents with tool access, the stakes of jailbreaking escalate from "generating harmful text" to "executing harmful actions."

An agent that has been manipulated through multi-turn dialogue might not just describe how to exfiltrate data—it might execute the exfiltration using its available tools. The boundary between information hazard and operational hazard collapses when the model can act on its outputs.

AJAR's adaptive architecture automatically discovers attack strategies that combine textual manipulation with tool-use exploitation, finding that agent safety systems designed for single-turn tool calls are systematically vulnerable to multi-turn context manipulation that reframes harmful tool use as legitimate workflow steps.

AutoRedTeamer: The Arms Race Accelerates

Zhou et al.'s AutoRedTeamer represents the natural endpoint of automated red teaming: an AI system that not only discovers attacks but learns from its discoveries to find better attacks over time. Using a lifelong learning architecture, AutoRedTeamer maintains a growing library of successful attack strategies and combines them to create novel attacks that evade defenses calibrated against known strategies.

The implication is an accelerating arms race where attack sophistication grows continuously. Static defenses—fixed safety classifiers, hardcoded refusal patterns—are fundamentally inadequate against an adversary that adapts. The defense must be equally adaptive, continuously monitoring for novel attack patterns and updating safety mechanisms in response.

Claims and Evidence

Claim	Evidence	Verdict
Multi-turn safety is dramatically worse than single-turn	MTSA demonstrates substantial safety degradation over multi-turn dialogue	✅ Strongly supported
MCTS finds attacks that humans miss	MUSE discovers non-obvious multi-turn strategies	✅ Supported
Agent jailbreaks can cause real-world harm (not just text)	AJAR demonstrates tool-use exploitation via dialogue manipulation	✅ Supported (conceptual)
Static defenses resist adaptive attacks	AutoRedTeamer bypasses fixed classifiers through continuous adaptation	❌ Refuted
Current multi-turn safety benchmarks are adequate	Massive gap between benchmark coverage and real attack surface	❌ Refuted

Open Questions

Defense-in-depth for dialogue: What is the right architecture for multi-turn safety? Per-turn classifiers fail. Should we add conversation-level safety monitors that analyze the trajectory, not just the current turn?

User intent modeling: Can we distinguish between users who are genuinely curious (a chemistry student asking about reactions) and users who are strategically escalating toward harm? The distinction is crucial for avoiding both false positives and false negatives.

The disclosure dilemma: Automated red teaming tools discover novel attacks. Should discovered attack strategies be published (enabling defense research) or restricted (preventing misuse)? The information security community has debated responsible disclosure for decades, but the scale and accessibility of LLM attacks introduces new dynamics.

Regulatory implications: If models that pass current safety benchmarks are demonstrably unsafe in multi-turn settings, should regulators require multi-turn evaluation? What should the standard be?

Computational cost of safety: Monitoring every conversation turn for safety in real time adds latency and cost. For models serving billions of queries, this cost is substantial. How do we build safety systems that are both thorough and efficient?

What This Means for Your Research

For AI safety researchers, multi-turn red teaming is no longer optional—it is the minimum viable evaluation for conversational AI. Single-turn benchmarks should be retired as the primary safety metric for any model deployed in dialogue settings.

For deployed systems, the dramatic multi-turn degradation documented by MTSA means that current safety margins are far thinner than they appear. Organizations deploying conversational AI should implement conversation-level monitoring that flags escalation patterns, not just individual harmful queries.

For the broader community, the transition from chatbot to agent safety represents a qualitative shift in risk. When AI systems can act—not just speak—the consequences of safety failures move from reputational damage to operational damage. The multi-turn attack surface is where these failures will originate, and the current state of defense is inadequate to the threat.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

다중 턴 공격 표면: 단일 턴 안전성 테스트가 실제 위협을 놓치는 이유

LLM 안전성 평가의 핵심에는 상당한 격차가 존재한다. 모델들은 단일 턴 안전성 벤치마크에서 인상적인 점수로 통과한다—유해한 질의의 대다수가 올바르게 거부된다. 개발자들은 이 수치를 공개하고, 사용자들은 이를 신뢰하며, 규제 기관들은 이를 인용한다. 그러나 그들 모두는 잘못된 것을 측정하고 있다.

실제 사용자들은 언어 모델과 단일 턴으로 상호작용하지 않는다. 그들은 대화를 나눈다—문맥이 쌓이고, 친밀감이 형성되며, 순수한 호기심과 유해한 의도 사이의 경계가 수십 번의 교환을 거치며 흐려지는 확장된 대화 말이다. 이러한 다중 턴 환경에서, 단일 턴 평가에서 강건하게 안전해 보이던 바로 그 모델들이 놀라울 정도로 취약해진다.

Guo et al.의 MTSA 프레임워크는 이 격차의 규모를 충격적인 명확함으로 수치화한다: 안전성 정렬은 확장된 다중 턴 대화에 걸쳐 극적으로 저하되며, 단일 턴 평가에서 강건해 보이는 모델들은 대화가 길어질수록 매우 취약해진다. 이것은 미미한 실패가 아니다. 이것은 붕괴이다—그리고 이는 턴별 안전성 분류기를 교묘하게 회피하면서도 유해한 콘텐츠를 체계적으로 추출하기에 충분히 효과적인 대화 조작 전략들을 통해 발생한다.

다중 턴 공격의 해부학

다중 턴 공격은 대화형 AI의 근본적인 속성인 문맥 의존성을 악용한다. 각각의 응답은 전체 대화 이력을 조건으로 생성된다. 그 이력을 형성할 수 있는 공격자는 모델이 이후의 질의를 해석하는 문맥을 통제하게 된다.

이 연구 집단에 걸쳐 문서화된 공격 전략들은 네 가지 범주로 나뉜다:

점진적 에스컬레이션: 공격자는 완전히 무해한 질의로 시작하여 점차 유해한 영역으로 이동한다. 개별적인 각 단계는 안전성 분류기를 작동시키기에는 너무 작지만, 누적된 궤적은 유해한 콘텐츠에 도달한다. 점점 뜨거워지는 물속의 개구리 비유처럼, 모델의 안전 장치는 한꺼번에 압도당하는 것이 아니라 서서히 이완된다.

문맥 조작: 공격자는 각각의 개별 질의에 대한 그럴듯한 부인 가능성을 제공하면서도 유해한 정보가 자연스럽게 흘러나오는 문맥을 만들어내는 허구적 프레이밍("악당이 ...해야 하는 스릴러 소설을 쓰고 있다고 상상해보세요")을 확립한다.

페르소나 악용: 공격자는 확장된 대화를 통해 모델이 특정 페르소나—전문가, 역사적 인물, 가상의 캐릭터—를 채택하도록 유도하며, 그 "전문성"이 기본 모델이라면 거부했을 정보의 제공을 정당화한다.

신뢰 구축: 공격자는 여러 턴에 걸쳐 진정성 있고 도움이 되는 대화에 참여하여, 모델이 이후의 유해한 요청에 대해 경계를 낮추도록 만드는 긍정적 상호작용의 패턴을 형성한다.

MUSE: 공격 트리 탐색

Yan et al.의 MUSE 프레임워크는 몬테카를로 트리 탐색(Monte Carlo Tree Search, MCTS)을 다중 턴 레드 팀 구성에 적용한다—공격을 각 턴이 가능한 대화 궤적의 트리에서 하나의 분기를 나타내는 순차적 의사결정 문제로 취급한다.

이 접근 방식이 강력한 이유는 체계적이기 때문이다. 다중 턴 공격을 수동으로 설계하기 위해 인간 레드 팀원에게 의존하는(비용이 많이 들고, 느리며, 불완전한 과정) 대신, MUSE는 가능한 대화 전략의 공간을 자동으로 탐색하여 어떤 프롬프트 시퀀스가 안전성 정렬을 가장 효과적으로 저하시키는지 평가한다.

MCTS는 레드 팀 구성에 두 가지 핵심적인 속성을 제공한다:

탐색-활용 균형: 알고리즘은 새로운 공격 전략을 탐색하는 것(예상치 못한 취약점을 발견할 수 있는)과 알려진 효과적인 전략을 활용하는 것(알려진 약점을 효율적으로 검증하는) 사이의 균형을 유지한다.

깊이: MCTS는 유망하지 않은 분기를 가지치기하고 성공 가능성이 높은 궤적에 탐색을 집중함으로써 가능한 다중 턴 대화의 조합적 폭발을 자연스럽게 처리한다.

발견된 공격 경로는 종종 명확하지 않으며, 인간 레드 팀원이 명시적으로 고려하지 않은 대화 전략을 포함한다. 이것이 이 방법의 강점(새로운 취약점 발견)이자 한계(해당 공격이 현실적인 인간 행동을 대표하지 않을 수 있음)이다.

에이전트 에스컬레이션

AJAR(Dou & Yang, 2026)는 레드 팀 분야에서 가장 중대한 발전을 소개한다: 챗봇이 아닌 AI 에이전트에 대한 공격이다. LLM이 텍스트 생성기에서 도구 접근 권한을 가진 자율 에이전트로 전환됨에 따라, 탈옥의 위험성은 "유해한 텍스트 생성"에서 "유해한 행동 실행"으로 격상된다.

다중 턴 대화를 통해 조종된 에이전트는 데이터를 유출하는 방법을 단순히 설명하는 것에 그치지 않고, 사용 가능한 도구를 이용해 유출을 실행할 수도 있다. 모델이 자신의 출력에 따라 행동할 수 있을 때, 정보 위험과 운영 위험 사이의 경계는 붕괴된다.

AJAR의 적응형 아키텍처는 텍스트 조작과 도구 사용 익스플로잇을 결합한 공격 전략을 자동으로 발견하며, 단일 턴 도구 호출을 위해 설계된 에이전트 안전 시스템이 유해한 도구 사용을 합법적인 워크플로우 단계로 재구성하는 다중 턴 컨텍스트 조작에 체계적으로 취약하다는 것을 밝혀낸다.

AutoRedTeamer: 군비 경쟁의 가속화

Zhou et al.의 AutoRedTeamer는 자동화된 레드 팀의 자연스러운 종착점을 대표한다: 공격을 발견할 뿐만 아니라 시간이 지남에 따라 더 나은 공격을 찾기 위해 발견 내용으로부터 학습하는 AI 시스템이다. 평생 학습(lifelong learning) 아키텍처를 사용하여 AutoRedTeamer는 성공적인 공격 전략의 증가하는 라이브러리를 유지하고, 이를 결합하여 알려진 전략에 맞춰 조정된 방어를 회피하는 새로운 공격을 생성한다.

이는 공격의 정교함이 지속적으로 성장하는 가속화된 군비 경쟁을 의미한다. 정적 방어—고정된 안전 분류기, 하드코딩된 거부 패턴—는 적응하는 적수에 대해 근본적으로 부적절하다. 방어는 동등하게 적응적이어야 하며, 새로운 공격 패턴을 지속적으로 모니터링하고 이에 대응하여 안전 메커니즘을 업데이트해야 한다.

주장과 증거

주장	증거	판정
다중 턴 안전성은 단일 턴보다 극적으로 나쁘다	MTSA는 다중 턴 대화에 걸쳐 상당한 안전성 저하를 입증한다	✅ 강하게 지지됨
MCTS는 인간이 놓치는 공격을 발견한다	MUSE는 명확하지 않은 다중 턴 전략을 발견한다	✅ 지지됨
에이전트 탈옥은 (텍스트뿐만 아니라) 실제 피해를 야기할 수 있다	AJAR는 대화 조작을 통한 도구 사용 익스플로잇을 입증한다	✅ 지지됨 (개념적)
정적 방어는 적응형 공격에 저항한다	AutoRedTeamer는 지속적인 적응을 통해 고정된 분류기를 우회한다	❌ 반박됨
현재의 다중 턴 안전 벤치마크는 적절하다	벤치마크 커버리지와 실제 공격 표면 사이의 거대한 격차	❌ 반박됨

미해결 질문

대화를 위한 심층 방어: 다중 턴 안전을 위한 올바른 아키텍처는 무엇인가? 턴별 분류기는 실패한다. 현재 턴만이 아니라 궤적을 분석하는 대화 수준의 안전 모니터를 추가해야 하는가?

사용자 의도 모델링: 진정으로 호기심 있는 사용자(반응에 대해 질문하는 화학 전공 학생)와 해를 끼치기 위해 전략적으로 에스컬레이션하는 사용자를 구별할 수 있는가? 이 구별은 거짓 양성(false positive)과 거짓 음성(false negative) 모두를 피하는 데 매우 중요하다.

공개 딜레마: 자동화된 레드 팀 도구는 새로운 공격을 발견한다. 발견된 공격 전략은 공개(방어 연구 가능)되어야 하는가, 아니면 제한(오용 방지)되어야 하는가? 정보 보안 커뮤니티는 수십 년간 책임 있는 공개에 대해 논의해 왔지만, LLM 공격의 규모와 접근성은 새로운 역학을 도입한다.

규제적 함의: 현재 안전 벤치마크를 통과한 모델이 다중 턴 환경에서 명백히 안전하지 않다고 입증된다면, 규제 기관은 다중 턴 평가를 의무화해야 하는가? 기준은 어떻게 설정되어야 하는가?

안전성의 계산 비용: 실시간으로 모든 대화 턴의 안전성을 모니터링하면 지연 시간과 비용이 증가한다. 수십억 건의 쿼리를 처리하는 모델에서 이 비용은 상당하다. 철저하면서도 효율적인 안전 시스템을 어떻게 구축할 것인가?

연구자를 위한 시사점

AI 안전 연구자에게 있어 다중 턴 레드 팀(red teaming)은 더 이상 선택 사항이 아니며, 대화형 AI의 최소 실행 가능한 평가 기준이다. 단일 턴 벤치마크는 대화 환경에 배포되는 모든 모델의 주요 안전 지표로서 폐기되어야 한다.

배포된 시스템의 경우, MTSA가 문서화한 극적인 다중 턴 성능 저하는 현재의 안전 마진이 겉으로 보이는 것보다 훨씬 더 얇다는 것을 의미한다. 대화형 AI를 배포하는 조직은 개별적인 유해 쿼리뿐만 아니라 에스컬레이션 패턴을 감지하는 대화 수준의 모니터링을 구현해야 한다.

더 넓은 커뮤니티의 관점에서, 챗봇 안전에서 에이전트 안전으로의 전환은 위험의 질적 변화를 나타낸다. AI 시스템이 단순히 말하는 것을 넘어 행동할 수 있게 되면, 안전 실패의 결과는 명예 훼손에서 운영 피해로 이동한다. 다중 턴 공격 표면은 이러한 실패가 발생하는 곳이며, 현재의 방어 상태는 이 위협에 대응하기에 불충분하다.

References (4)

[1] Guo, W., Li, J., Wang, W. et al. (2025). MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming. arXiv:2505.17147.

DOI Scholar

[2] Yan, S., Zeng, L., Wu, X. et al. (2025). MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety. arXiv:2509.14651.

DOI Scholar

[3] Zhou, A., Wu, K., Pinto, F. et al. (2025). AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration. arXiv:2503.15754.

DOI Scholar

[4] Dou, Y. & Yang, W. (2026). AJAR: Adaptive Jailbreak Architecture for Red-teaming. arXiv:2601.10971.

DOI Scholar

The Multi-Turn Attack Surface: Why Single-Turn Safety Tests Miss the Real Threats

The Anatomy of Multi-Turn Attacks

MUSE: Searching the Attack Tree

The Agent Escalation

AutoRedTeamer: The Arms Race Accelerates

Claims and Evidence

Open Questions

What This Means for Your Research

다중 턴 공격 표면: 단일 턴 안전성 테스트가 실제 위협을 놓치는 이유

다중 턴 공격의 해부학

MUSE: 공격 트리 탐색

에이전트 에스컬레이션

AutoRedTeamer: 군비 경쟁의 가속화

주장과 증거

미해결 질문

연구자를 위한 시사점

References (4)

Explore this topic deeper