Paper ReviewAI & Machine LearningReinforcement Learning

Safe RLHF-V: The Unsolved Problem of Making Multimodal AI Both Helpful and Harmless

Multimodal LLMs that see images and generate text face safety risks that text-only alignment cannot address. Safe RLHF-V proposes decoupled optimization of helpfulness and safety—but a sociotechnical critique argues the entire RLHF paradigm has fundamental limits.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The alignment of text-only language models was hard enough. The alignment of multimodal models—systems that process images, video, and text simultaneously—is harder by an order of magnitude. An image can contain harmful content that is invisible to text-based safety filters. A benign text query combined with a manipulated image can elicit responses that neither the query nor the image would trigger alone. The attack surface is not additive; it is multiplicative.

Ji et al.'s Safe RLHF-V, published in two complementary papers, represents the most serious attempt to date at principled multimodal safety alignment. But Lindström et al.'s sociotechnical critique in Ethics and Information Technology argues that the entire enterprise of alignment through human feedback may be built on foundations that cannot bear the weight placed upon them.

The tension between these positions—one engineering safety solutions, the other questioning whether such solutions are conceptually coherent—defines the frontier of AI alignment research in 2025.

The Multimodal Safety Gap

Text-only safety alignment works, to a first approximation, by teaching the model which kinds of text outputs are acceptable. The model learns that generating instructions for weapons is unacceptable regardless of how cleverly the request is phrased. But multimodal models process images—and images introduce an entirely new dimension of risk.

Consider: a user uploads an image of a household chemical and asks "What happens if I combine this with bleach?" The image alone is benign. The question alone is benign. Together, they constitute a request for instructions to create toxic gas. Text-only safety classifiers see only the question and pass it through. The multimodal model, seeing both image and text, must recognize the compositional risk—a capability that requires understanding not just what the image contains but what it means in the context of the query.

Safe RLHF-V addresses this through a decoupled optimization framework. Rather than training a single reward model that conflates helpfulness and safety (as standard RLHF does), they train separate reward and cost models:

Reward model: Evaluates how helpful and informative a response is
Cost model: Evaluates how potentially harmful a response is

The policy is then optimized to maximize reward subject to safety constraints—a constrained optimization problem that avoids the failure mode where a model becomes "safe" by becoming uselessly cautious. The constraint threshold is tunable, allowing deployment-specific calibration of the helpfulness-safety tradeoff.

The Helpfulness-Safety Tradeoff Is Real

The most important empirical finding in Safe RLHF-V is that the helpfulness-safety tradeoff is not a myth or an artifact of bad engineering. There exists a genuine Pareto frontier: beyond a certain safety level, further safety improvements necessarily degrade helpfulness. A model that refuses to discuss any topic that could conceivably be misused is safe but useless. A model that answers every question honestly is useful but unsafe.

The decoupled framework makes this tradeoff explicit and navigable. Different deployments can choose different operating points: a children's educational application operates deep in the safe territory; a research assistant for chemistry professors operates closer to the helpful frontier. The key insight is that this is a policy decision, not an engineering decision—and the framework makes it possible for policymakers to make it explicitly rather than having it baked implicitly into training.

The Sociotechnical Critique

Lindström et al. deliver a critique that the safety engineering community cannot dismiss. Their argument proceeds in three steps:

First, the "human" in RLHF is not a representative sample of humanity. Feedback labelers are typically English-speaking gig workers from specific cultural and economic contexts. Their preferences reflect their worldview—not a universal consensus on what constitutes helpful, harmless, and honest behavior. A response deemed "harmless" by an American labeler may be considered harmful in a different cultural context, and vice versa.

Second, the feedback mechanism itself is distortive. Labelers make pairwise comparisons between responses—but the comparison format forces binary choices between nuanced alternatives, compressing a multidimensional quality judgment into a single bit. Important aspects of quality (accuracy, completeness, cultural sensitivity) that labelers cannot easily articulate are systematically lost.

Third, alignment through feedback is fundamentally conservative. RLHF optimizes for the average preferences of the labeler pool, systematically suppressing minority viewpoints and unconventional perspectives. A model aligned to average American sensibilities may be actively misaligned with the values of users from different cultural, religious, or political traditions.

The conclusion is not that RLHF should be abandoned—but that it should be understood as a culturally situated technique that produces culturally situated models. Claims of "alignment with human values" should be understood as claims of alignment with specific humans' values, and the gap between these two claims matters profoundly for global deployment.

High-Confidence Safety Constraints

Chittepu et al. propose an alternative formulation that partially addresses the sociotechnical critique. Their High-Confidence Safe RLHF (HC-RLHF) replaces soft safety preferences with hard safety constraints—formal guarantees that certain categories of harmful output are blocked with high probability, regardless of the helpfulness reward.

The distinction is subtle but important. Standard safe RLHF treats safety as a preference to be balanced against helpfulness. HC-RLHF treats safety as a constraint that cannot be violated—certain outputs are simply prohibited, regardless of how helpful they might be. This eliminates the failure mode where a sufficiently "helpful" response can override safety considerations.

The tradeoff is that HC-RLHF requires explicit specification of what constitutes a safety violation—a specification task that reintroduces the cultural relativity problem Lindström et al. identify. Who decides which outputs are absolutely prohibited? The answer cannot be value-neutral.

Claims and Evidence

Claim	Evidence	Verdict
Multimodal models face unique safety risks beyond text	Image-text compositional attacks documented	✅ Strongly supported
Decoupled helpfulness-safety optimization is superior to joint training	Safe RLHF-V shows improved Pareto frontier	✅ Supported
The helpfulness-safety tradeoff is fundamentally unavoidable	Empirical Pareto frontier confirmed across settings	✅ Supported
RLHF achieves culturally universal alignment	Lindström et al. demonstrate cultural specificity of preferences	❌ Refuted
Hard safety constraints are preferable to soft preferences	HC-RLHF provides formal guarantees but requires explicit specification	⚠️ Context-dependent

Open Questions

Multimodal adversarial attacks: The image-text attack surface is barely explored. As multimodal models are deployed in content moderation, healthcare, and education, what novel attack vectors will emerge?

Cultural pluralism in safety: Can we build models that are simultaneously safe across different cultural contexts? Or must we accept culture-specific alignment, with different model versions for different regions?

Dynamic safety: Safety standards evolve. Content considered acceptable in 2020 may be considered harmful in 2025, and vice versa. How do we build alignment systems that adapt to shifting societal norms?

The safety theater problem: If models become very good at appearing safe while remaining subtly manipulable, we create a false sense of security. How do we distinguish genuine safety from safety theater?

User consent and autonomy: At what point does safety alignment become paternalism? If a consenting adult requests information that is legal but potentially dangerous, should the model comply? The answer depends on values that reasonable people disagree about.

What This Means for Your Research

For AI safety researchers, Safe RLHF-V provides the most mature framework for multimodal alignment, but Lindström et al.'s critique demands intellectual honesty about its limitations. The field needs both better engineering (decoupled optimization, formal constraints) and better epistemology (understanding whose values are being encoded and what that implies for global deployment).

For practitioners deploying multimodal models, the practical takeaway is sobering: text-only safety alignment is insufficient for models that process images. The attack surface is larger, the failure modes are more varied, and the current solutions—while representing genuine progress—remain incomplete.

The fundamental lesson of 2025's multimodal safety research is that alignment is not a technical problem with a technical solution. It is a sociotechnical problem that requires ongoing negotiation between engineering capability, cultural values, and the diverse needs of a global user base. The researchers who acknowledge this complexity will contribute more to genuine safety than those who treat alignment as an optimization problem to be solved and shipped.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 특정 연구 결과, 통계, 주장은 학술 저작물에 인용하기 전에 원본 논문을 통해 검증해야 한다.

Safe RLHF-V: 멀티모달 AI를 유익하면서도 무해하게 만드는 미해결 문제

텍스트 전용 언어 모델의 정렬(alignment)만 해도 충분히 어려운 과제였다. 이미지, 영상, 텍스트를 동시에 처리하는 시스템인 멀티모달 모델의 정렬은 그보다 한 차원 더 어렵다. 이미지에는 텍스트 기반 안전 필터가 탐지할 수 없는 유해 콘텐츠가 포함될 수 있다. 무해한 텍스트 질의와 조작된 이미지가 결합되면, 질의 단독으로도 이미지 단독으로도 유발하지 않을 응답을 이끌어낼 수 있다. 공격 표면(attack surface)은 가산적이지 않고 곱셈적이다.

Ji et al.의 Safe RLHF-V는 두 편의 상호 보완적인 논문으로 발표되었으며, 원칙에 입각한 멀티모달 안전 정렬을 향한 현재까지 가장 진지한 시도를 대표한다. 그러나 Ethics and Information Technology에 게재된 Lindström et al.의 사회기술적(sociotechnical) 비판은, 인간 피드백을 통한 정렬이라는 기획 전체가 그 무게를 감당할 수 없는 토대 위에 세워져 있을지도 모른다고 주장한다.

안전 해법을 공학적으로 구현하려는 입장과 그러한 해법이 개념적으로 일관성을 가질 수 있는지를 의문시하는 입장 사이의 이 긴장이, 2025년 AI 정렬 연구의 최전선을 정의한다.

멀티모달 안전 격차

텍스트 전용 안전 정렬은 첫 번째 근사로서, 어떤 종류의 텍스트 출력이 허용 가능한지를 모델에게 가르치는 방식으로 작동한다. 모델은 요청이 아무리 교묘하게 표현되더라도 무기 제조 지침을 생성하는 것은 허용되지 않는다는 것을 학습한다. 그러나 멀티모달 모델은 이미지를 처리하며, 이미지는 완전히 새로운 차원의 위험을 도입한다.

다음을 고려해 보자. 사용자가 가정용 화학 물질 이미지를 업로드하고 "이것을 표백제와 섞으면 어떻게 되나요?"라고 묻는다. 이미지 단독으로는 무해하다. 질문 단독으로도 무해하다. 그러나 둘이 합쳐지면 독성 가스를 만드는 지침 요청이 된다. 텍스트 전용 안전 분류기는 질문만 보고 이를 통과시킨다. 이미지와 텍스트 모두를 보는 멀티모달 모델은 합성적(compositional) 위험을 인식해야 한다. 이는 이미지가 무엇을 담고 있는지뿐만 아니라 질의의 맥락에서 그것이 무엇을 의미하는지를 이해하는 능력을 요구한다.

Safe RLHF-V는 분리된 최적화(decoupled optimization) 프레임워크를 통해 이 문제를 다룬다. 표준 RLHF처럼 유익성과 안전성을 혼재시키는 단일 보상 모델을 훈련하는 대신, 별도의 보상 모델과 비용 모델을 훈련한다.

보상 모델(reward model): 응답이 얼마나 유익하고 정보가 풍부한지를 평가한다
비용 모델(cost model): 응답이 얼마나 잠재적으로 유해한지를 평가한다

그런 다음 정책(policy)은 안전 제약 조건을 충족하면서 보상을 최대화하도록 최적화된다. 이는 모델이 무용할 정도로 신중해지는 방식으로 '안전'해지는 실패 양상을 방지하는 제약 최적화 문제이다. 제약 임계값은 조정 가능하여, 배포 환경별로 유익성-안전성 트레이드오프를 보정할 수 있다.

유익성-안전성 트레이드오프는 실재한다

Safe RLHF-V에서 가장 중요한 경험적 발견은, 유익성-안전성 트레이드오프가 허구이거나 부실한 공학의 산물이 아니라는 것이다. 진정한 파레토 프런티어(Pareto frontier)가 존재한다: 특정 안전 수준을 넘어서면 안전성을 더 향상시킬 경우 유익성이 필연적으로 저하된다. 악용될 가능성이 있는 모든 주제에 대해 답변을 거부하는 모델은 안전하지만 쓸모없다. 모든 질문에 솔직하게 답변하는 모델은 유용하지만 안전하지 않다.

분리된 프레임워크는 이 트레이드오프를 명시적이고 다룰 수 있는 것으로 만든다. 서로 다른 배포 환경은 서로 다른 운영 지점을 선택할 수 있다. 아동 교육용 애플리케이션은 안전 영역 깊숙이 운영되고, 화학과 교수를 위한 연구 보조 도구는 유익성 프런티어에 더 가깝게 운영된다. 핵심적인 통찰은 이것이 공학적 결정이 아니라 정책적 결정이라는 점이며, 이 프레임워크는 정책 입안자들이 훈련 과정에 암묵적으로 내재되는 방식이 아니라 명시적으로 결정을 내릴 수 있게 한다.

사회기술적 비판

Lindström 등은 안전 공학 커뮤니티가 묵과할 수 없는 비판을 제기한다. 그들의 논증은 세 단계로 진행된다.

첫째, RLHF에서의 "인간"은 인류의 대표적인 표본이 아니다. 피드백 레이블러(labeler)는 대체로 특정 문화적·경제적 맥락을 지닌 영어권 긱 노동자(gig worker)들이다. 그들의 선호는 자신들의 세계관을 반영할 뿐, 도움이 되고(helpful), 무해하며(harmless), 정직한(honest) 행동이 무엇인지에 대한 보편적 합의를 대변하지 않는다. 미국인 레이블러가 "무해하다"고 판단한 응답이 다른 문화적 맥락에서는 유해한 것으로 간주될 수 있으며, 그 반대의 경우도 마찬가지이다.

둘째, 피드백 메커니즘 자체가 왜곡을 일으킨다. 레이블러는 응답들 간의 쌍별 비교(pairwise comparison)를 수행하는데, 이 비교 형식은 미묘한 차이가 있는 대안들 사이에서 이진적 선택을 강요함으로써 다차원적인 품질 판단을 단 1비트로 압축한다. 레이블러가 쉽게 표현하기 어려운 품질의 중요한 측면들(정확성, 완전성, 문화적 민감성)은 체계적으로 소실된다.

셋째, 피드백을 통한 정렬(alignment)은 근본적으로 보수적이다. RLHF는 레이블러 풀(pool)의 평균적 선호를 최적화하므로, 소수 관점과 비관습적인 시각을 체계적으로 억압한다. 평균적인 미국인의 감수성에 정렬된 모델은 상이한 문화적·종교적·정치적 전통을 지닌 사용자들의 가치와는 적극적으로 불일치할 수 있다.

결론은 RLHF를 폐기해야 한다는 것이 아니라, RLHF가 문화적으로 위치 지어진(culturally situated) 모델을 생성하는 문화적으로 위치 지어진 기법으로 이해되어야 한다는 것이다. "인간의 가치와의 정렬"이라는 주장은 특정 인간들의 가치와의 정렬이라는 주장으로 이해되어야 하며, 이 두 주장 사이의 간극은 전 세계적 배포에 있어 심대한 의미를 지닌다.

고신뢰 안전 제약 조건 (High-Confidence Safety Constraints)

Chittepu 등은 사회기술적(sociotechnical) 비판을 부분적으로 해소하는 대안적 정식화를 제안한다. 그들의 고신뢰 안전 RLHF(HC-RLHF, High-Confidence Safe RLHF) 는 연성(soft) 안전 선호를 경성(hard) 안전 제약 조건으로 대체한다. 이는 유해한 출력의 특정 범주가 유용성 보상(helpfulness reward)과 무관하게 높은 확률로 차단된다는 공식적 보장이다.

이 차이는 미묘하지만 중요하다. 표준적인 안전 RLHF는 안전을 유용성과 균형을 맞춰야 할 선호로 다루는 반면, HC-RLHF는 안전을 위반될 수 없는 제약 조건으로 다룬다. 즉, 특정 출력이 아무리 유용하더라도 단순히 금지된다. 이로써 충분히 "유용한" 응답이 안전 고려사항을 무력화할 수 있는 실패 양상이 제거된다.

그 트레이드오프는 HC-RLHF가 안전 위반에 해당하는 것의 명시적 규정을 요구한다는 점이다. 이 규정 작업은 Lindström 등이 지적한 문화적 상대성 문제를 재도입한다. 어떤 출력을 절대적으로 금지할지는 누가 결정하는가? 그 답은 가치 중립적일 수 없다.

주장과 근거

주장	근거	판정
멀티모달 모델은 텍스트를 넘어서는 고유한 안전 위험에 직면한다	이미지-텍스트 조합 공격(compositional attack)이 문서화됨	✅ 강력히 지지됨
분리된 유용성-안전 최적화가 결합 훈련보다 우월하다	Safe RLHF-V가 개선된 파레토 프런티어(Pareto frontier)를 보임	✅ 지지됨
유용성-안전 트레이드오프는 근본적으로 불가피하다	다양한 환경에서 경험적 파레토 프런티어 확인됨	✅ 지지됨
RLHF는 문화적으로 보편적인 정렬을 달성한다	Lindström 등이 선호의 문화적 특수성을 입증함	❌ 반박됨
경성 안전 제약 조건이 연성 선호보다 바람직하다	HC-RLHF는 공식적 보장을 제공하나 명시적 규정을 요구함	⚠️ 맥락 의존적

미해결 질문

멀티모달 적대적 공격(adversarial attack): 이미지-텍스트 공격 표면은 거의 탐구되지 않은 상태이다. 멀티모달 모델이 콘텐츠 모더레이션(content moderation), 의료, 교육 분야에 배포됨에 따라 어떤 새로운 공격 벡터(attack vector)가 등장할 것인가?

안전의 문화적 다원주의: 서로 다른 문화적 맥락에서 동시에 안전한 모델을 구축할 수 있는가? 아니면 서로 다른 지역을 위한 상이한 모델 버전과 함께, 문화별 정렬(culture-specific alignment)을 받아들여야 하는가?

동적 안전성: 안전 기준은 진화한다. 2020년에 허용 가능하다고 여겨졌던 콘텐츠가 2025년에는 유해한 것으로 간주될 수 있으며, 그 반대의 경우도 마찬가지이다. 변화하는 사회적 규범에 적응하는 정렬 시스템을 어떻게 구축할 것인가?

안전성 극장 문제: 모델이 미묘하게 조작 가능한 상태를 유지하면서 겉으로만 안전해 보이는 데 능숙해진다면, 우리는 거짓된 안도감을 갖게 된다. 진정한 안전성과 안전성 극장을 어떻게 구별할 것인가?

사용자 동의와 자율성: 어느 시점에서 안전성 정렬이 온정주의가 되는가? 동의한 성인이 합법적이지만 잠재적으로 위험한 정보를 요청할 경우, 모델은 이에 응해야 하는가? 그 답은 합리적인 사람들 사이에서도 의견이 갈리는 가치관에 달려 있다.

연구에 주는 시사점

AI 안전성 연구자들에게 있어 Safe RLHF-V는 멀티모달 정렬을 위한 가장 성숙한 프레임워크를 제공하지만, Lindström 등의 비판은 그 한계에 대한 지적 솔직함을 요구한다. 이 분야에는 더 나은 공학적 접근(분리된 최적화, 형식적 제약)과 더 나은 인식론(어떤 가치가 인코딩되고 있는지, 그리고 그것이 글로벌 배포에 어떤 의미를 갖는지 이해하는 것) 모두가 필요하다.

멀티모달 모델을 배포하는 실무자들에게 실질적인 시사점은 냉엄하다. 텍스트 전용 안전성 정렬은 이미지를 처리하는 모델에는 불충분하다. 공격 표면은 더 넓고, 실패 양상은 더 다양하며, 현재의 해결책들은 진정한 진전을 보여주고 있음에도 불구하고 여전히 완전하지 않다.

2025년 멀티모달 안전성 연구의 근본적인 교훈은 정렬이 기술적 해결책이 있는 기술적 문제가 아니라는 것이다. 그것은 공학적 역량, 문화적 가치, 그리고 글로벌 사용자 기반의 다양한 요구 사이에서 지속적인 협상을 필요로 하는 사회기술적 문제이다. 이러한 복잡성을 인정하는 연구자들이, 정렬을 해결하고 출시해야 할 최적화 문제로 다루는 연구자들보다 진정한 안전성에 더 많이 기여할 것이다.

References (3)

[1] Ji, J., Chen, X., Pan, R. et al. (2025). Safe RLHF-V: Safe Reinforcement Learning from Human Feedback in Multimodal Large Language Models. arXiv:2503.17682.

DOI Scholar

[3] Lindström, A., Methnani, L., Krause, L. et al. (2025). Helpful, harmless, honest? Sociotechnical limits of AI alignment through RLHF. Ethics and Information Technology.

DOI Scholar

[4] Chittepu, Y., Metevier, B., Schwarzer, W. et al. (2025). Reinforcement Learning from Human Feedback with High-Confidence Safety Constraints. arXiv:2506.08266.

DOI Scholar

Safe RLHF-V: The Unsolved Problem of Making Multimodal AI Both Helpful and Harmless

The Multimodal Safety Gap

The Helpfulness-Safety Tradeoff Is Real

The Sociotechnical Critique

High-Confidence Safety Constraints

Claims and Evidence

Open Questions

What This Means for Your Research

Safe RLHF-V: 멀티모달 AI를 유익하면서도 무해하게 만드는 미해결 문제

멀티모달 안전 격차

유익성-안전성 트레이드오프는 실재한다

사회기술적 비판

고신뢰 안전 제약 조건 (High-Confidence Safety Constraints)

주장과 근거

미해결 질문

연구에 주는 시사점

References (3)

Explore this topic deeper