Critical ReviewAI & Machine Learning

The Specification Trap: Why RLHF Is a Safety Measure, Not an Alignment Solution

RLHF, Constitutional AI, and inverse reinforcement learning are widely treated as alignment solutions. A philosophical analysis argues they are something more modest: safety measures that cannot, in principle, produce robust alignment under capability scaling. The distinction matters more than it might seem.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The AI safety community has an uncomfortable naming problem. Techniques like RLHF (Reinforcement Learning from Human Feedback), Constitutional AI, and inverse reinforcement learning are routinely described as "alignment" methods — as though they solve the alignment problem. A recent philosophical analysis argues they do not and, more importantly, cannot. What they provide is safety under constrained conditions: a ceiling, not a floor. The paper calls this the "specification trap," and the argument, if correct, has significant implications for how the field frames its own progress.

The Research Landscape

What Is the Specification Trap?

Spizzirri (2025) defines content-based AI value alignment as any approach that treats alignment as optimizing toward a formal value-object — a reward function, utility function, constitutional principles, or learned preference representation. The central argument is that this entire class of approaches cannot, by itself, produce robust alignment under three conditions that matter most: capability scaling, distributional shift, and increasing autonomy.

The paper draws on three philosophical results to support this claim:

Hume's is-ought gap: Behavioral data — what humans do, click, prefer, or rate — cannot entail normative conclusions about what an AI should do. RLHF learns from human preference data, but preference data describes what humans chose, not what is right. The gap between "is" and "ought" cannot be bridged by more data.

Berlin's value pluralism: Human values are irreducibly plural and incommensurable. There is no single reward function that captures the full space of human values because human values genuinely conflict — liberty versus equality, individual rights versus collective welfare, honesty versus kindness. Any formal specification must resolve these conflicts, and any resolution will be wrong in some contexts.

The extended frame problem: Any value encoding will misfit future contexts that advanced AI systems themselves create. The original frame problem in AI asks how a system knows which of its beliefs to update after an action. The extended version asks: how does a value specification remain valid when the system's own capabilities change the moral landscape?

How Each Method Falls Into the Trap

The paper examines four major alignment approaches and argues each instantiates the specification trap:

RLHF optimizes for a learned reward model that approximates human preferences. But the reward model is trained on preference data from a specific distribution. Under capability scaling, the model encounters situations the reward model has never seen. Under distributional shift, the preference data becomes stale. The reward model becomes a target to be Goodharted rather than a guide to be followed.

Constitutional AI replaces human feedback with a set of written principles. This addresses one failure mode of RLHF — human annotator inconsistency — but introduces another: the principles must be specified in advance, and no finite set of principles can anticipate all future contexts. Constitutional AI is "alignment by legislation," and legislation always has gaps.

Inverse reinforcement learning infers a reward function from observed behavior. But observed behavior reflects what humans do, not what they value. IRL inherits all the limitations of behavioral data plus the assumption that behavior is rational.

The Critical Distinction: Safety vs. Alignment

The paper's most important contribution is not the critique but the reframing. Spizzirri argues that these methods should be recognized as safety measures rather than alignment solutions. The difference is not semantic:

A safety measure reduces risk within a known operating envelope. Seatbelts are safety measures — they help when the car crashes, but they do not prevent crashes.
An alignment solution would ensure the system's goals remain compatible with human values across all conditions, including novel ones.

Drawing on Fischer and Ravizza's compatibilist theory, the paper argues there is a principled distinction between simulated value-following and genuine reasons-responsiveness. A system that has been trained to produce outputs consistent with human preferences is not the same as a system that understands and responds to the reasons behind those preferences. Specification-based methods cannot produce the latter.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Content-based alignment cannot produce robust alignment under scaling	Philosophical argument from Hume's is-ought gap, value pluralism, and extended frame problem	⚠️ Logically coherent; not empirically tested
RLHF, Constitutional AI, IRL, and assistance games all exhibit the specification trap	Structural analysis of each method's assumptions	✅ Supported — each method's formal assumptions are accurately characterized
Proposed escape routes (continual updating, meta-preferences, moral realism) relocate the trap rather than exit it	Philosophical analysis of each escape route	⚠️ Plausible; alternative escape routes may exist
Behavioral compliance does not constitute alignment	Argument from Fischer and Ravizza's compatibilism	⚠️ Philosophically grounded; contested in the alignment community
These methods should be classified as safety measures	Definitional argument based on operating-envelope limitation	✅ Supported — the distinction is well-drawn

What This Argument Gets Right — and What It Leaves Open

The paper's strength is precision. It does not claim that RLHF is useless — it claims RLHF has a ceiling, and that this ceiling becomes safety-critical at the capability frontier. The paper acknowledges this directly: "The specification trap establishes a ceiling on content-based approaches, not their uselessness."

The limitation is that the argument is primarily philosophical. At what capability level does the specification trap become practically binding? The practical question — how close to the ceiling current systems are — remains open.

Open Questions and Future Directions

Process-based alternatives: The paper calls for reframing alignment from "value specification" to "value emergence." What would a process-based alignment method look like in practice?

Empirical ceiling detection: Can we design experiments that detect when a model's behavior transitions from genuine preference-following to specification-gaming? This would make the philosophical argument empirically testable.

Hybrid approaches: If content-based methods provide safety but not alignment, can they be combined with process-based methods to extend the safe operating envelope while pursuing genuine alignment?

The reasons-responsiveness test: Fischer and Ravizza's framework suggests that a truly aligned system would respond appropriately to novel reasons. Can we operationalize "reasons-responsiveness" as a measurable property of AI systems?

Temporal validity of specifications: Constitutional AI principles are written at a point in time. How rapidly do they become inadequate? Is there a measurable decay rate for value specifications?

What This Means for Your Research

If you work on RLHF or Constitutional AI, this paper does not invalidate your work — it reframes it. The methods you are developing are safety measures, and safety measures matter. But calling them "alignment solutions" may create false confidence about the robustness of the resulting systems.

For alignment researchers, the specification trap suggests that the field may be over-indexed on methods that optimize formal value-objects and under-indexed on methods that develop genuine reasons-responsiveness. The path forward may require borrowing more from philosophy of mind and less from optimization theory.

Explore related alignment and safety research through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 리뷰이다. 학술 연구에서 인용하기 전에 특정 연구 결과, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

명세 함정: RLHF는 정렬 해결책이 아닌 안전 조치이다

AI 안전 커뮤니티에는 불편한 명명 문제가 있다. RLHF(인간 피드백 강화학습), Constitutional AI, 역강화학습과 같은 기법들이 정렬 문제를 해결하는 것처럼 "정렬" 방법으로 일상적으로 설명된다. 최근의 한 철학적 분석은 이러한 기법들이 그렇지 않으며, 더 중요하게는 그럴 수 없다고 주장한다. 이 기법들이 제공하는 것은 제한된 조건 하에서의 안전성이다: 바닥이 아닌 천장인 것이다. 해당 논문은 이를 "명세 함정(specification trap)"이라 부르며, 이 주장이 옳다면 해당 분야가 자체적인 진전을 프레이밍하는 방식에 중요한 함의를 지닌다.

연구 현황

명세 함정이란 무엇인가?

Spizzirri(2025)는 내용 기반 AI 가치 정렬을 정렬을 공식적인 가치 객체, 즉 보상 함수, 효용 함수, 헌법적 원칙, 또는 학습된 선호도 표현을 향해 최적화하는 것으로 취급하는 모든 접근 방식으로 정의한다. 핵심 주장은 이 전체 접근 방식 계열이 가장 중요한 세 가지 조건, 즉 능력 확장(capability scaling), 분포 이동(distributional shift), 자율성 증가 하에서 자체적으로는 견고한 정렬을 생성할 수 없다는 것이다.

논문은 이 주장을 뒷받침하기 위해 세 가지 철학적 결과를 도출한다:

흄의 사실-당위 간극: 행동 데이터, 즉 인간이 하는 것, 클릭하는 것, 선호하는 것, 평가하는 것은 AI가 해야 하는 것에 대한 규범적 결론을 도출할 수 없다. RLHF는 인간 선호도 데이터로부터 학습하지만, 선호도 데이터는 인간이 선택한 것을 설명할 뿐 무엇이 옳은지를 설명하지 않는다. "사실"과 "당위" 사이의 간극은 더 많은 데이터로 메울 수 없다.

Berlin의 가치 다원주의: 인간의 가치는 환원 불가능하게 복수적이고 공약 불가능하다. 인간의 가치는 진정으로 충돌하기 때문에 인간 가치의 전체 공간을 포착하는 단일한 보상 함수는 존재하지 않는다. 자유 대 평등, 개인의 권리 대 집단적 복지, 정직 대 친절함이 그 예이다. 어떤 공식적 명세도 이러한 충돌을 해소해야 하며, 어떤 해소 방식도 일부 맥락에서는 틀릴 것이다.

확장된 프레임 문제: 어떤 가치 인코딩도 발전된 AI 시스템이 스스로 만들어내는 미래 맥락에는 맞지 않게 될 것이다. AI에서 원래의 프레임 문제는 시스템이 어떤 행동 이후에 자신의 믿음 중 무엇을 업데이트해야 하는지를 어떻게 아는가를 묻는다. 확장된 버전은 이렇게 묻는다: 시스템 자체의 능력이 도덕적 지형을 변화시킬 때 가치 명세는 어떻게 유효성을 유지하는가?

각 방법이 함정에 빠지는 방식

논문은 네 가지 주요 정렬 접근 방식을 검토하고 각각이 명세 함정을 구현한다고 주장한다:

RLHF는 인간의 선호도를 근사하는 학습된 보상 모델을 최적화한다. 그러나 보상 모델은 특정 분포에서 얻은 선호도 데이터로 학습된다. 능력 확장 하에서 모델은 보상 모델이 한 번도 본 적 없는 상황에 직면한다. 분포 이동 하에서는 선호도 데이터가 구식이 된다. 보상 모델은 따라야 할 지침이 아니라 Goodhart 법칙의 표적이 된다.

Constitutional AI는 인간 피드백을 일련의 성문 원칙으로 대체한다. 이는 RLHF의 한 가지 실패 양식, 즉 인간 주석자의 불일관성을 해소하지만 다른 문제를 야기한다: 원칙들은 사전에 명세되어야 하며, 유한한 원칙 집합으로는 모든 미래 맥락을 예측할 수 없다. Constitutional AI는 "입법에 의한 정렬"이며, 법률에는 항상 공백이 있다.

역강화학습은 관찰된 행동으로부터 보상 함수를 추론한다. 그러나 관찰된 행동은 인간이 하는 것을 반영할 뿐 그들이 가치 있게 여기는 것을 반영하지 않는다. IRL은 행동 데이터의 모든 한계에 더해 행동이 합리적이라는 가정까지 물려받는다.

핵심 구분: 안전성 대 정렬

비판적 분석: 주장과 근거

주장	근거	평가
내용 기반 정렬은 스케일링 상황에서 강건한 정렬을 산출할 수 없다	Hume의 is-ought 간극, 가치 다원주의, 확장된 프레임 문제에 관한 철학적 논증	⚠️ 논리적으로 일관됨; 실증적으로 검증되지 않음
RLHF, Constitutional AI, IRL, 협조 게임 모두 명세 함정을 드러낸다	각 방법론의 가정에 대한 구조적 분석	✅ 지지됨 — 각 방법론의 형식적 가정이 정확하게 특성화되어 있음
제안된 탈출 경로(지속적 업데이트, 메타선호도, 도덕적 실재론)는 함정을 벗어나는 것이 아니라 재배치할 뿐이다	각 탈출 경로에 대한 철학적 분석	⚠️ 타당함; 대안적 탈출 경로가 존재할 수 있음
행동적 순응은 정렬을 구성하지 않는다	Fischer와 Ravizza의 양립 가능론에 근거한 논증	⚠️ 철학적으로 근거를 갖춤; 정렬 연구 커뮤니티에서 논쟁 중
이러한 방법론들은 안전 조치로 분류되어야 한다	작동 한계 제한에 기반한 정의적 논증	✅ 지지됨 — 구분이 명확하게 설정되어 있음

이 논문의 가장 중요한 기여는 비판이 아니라 재구성에 있다. Spizzirri는 이러한 방법론들이 정렬 해결책이 아닌 안전 조치로 인식되어야 한다고 주장한다. 이 차이는 단순히 의미론적인 것이 아니다:

안전 조치는 알려진 작동 한계 내에서 위험을 줄인다. 안전벨트는 안전 조치에 해당한다 — 자동차가 충돌할 때 도움이 되지만, 충돌 자체를 예방하지는 않는다.
정렬 해결책은 새로운 상황을 포함한 모든 조건에서 시스템의 목표가 인간의 가치와 양립 가능하도록 보장할 것이다.

Fischer와 Ravizza의 양립 가능론 이론에 의거하여, 이 논문은 시뮬레이션된 가치 추종과 진정한 이유 반응성 사이에 원칙적인 구분이 존재한다고 주장한다. 인간의 선호도와 일치하는 출력을 생성하도록 훈련된 시스템은, 그 선호도의 배경에 있는 이유를 이해하고 이에 반응하는 시스템과 동일하지 않다. 명세 기반 방법론은 후자를 산출할 수 없다.

이 논증이 올바르게 파악한 것 — 그리고 열린 채로 남겨진 것

이 논문의 강점은 정밀성에 있다. RLHF가 쓸모없다는 주장이 아니라 — RLHF에는 한계가 있으며, 그 한계가 역량의 최전선에서 안전에 중요한 문제가 된다는 주장이다. 논문은 이를 직접적으로 인정한다: "명세 함정은 내용 기반 접근법의 한계를 설정하는 것이지, 그 쓸모없음을 설정하는 것이 아니다."

한계는 이 논증이 주로 철학적이라는 점이다. 어느 역량 수준에서 명세 함정이 실질적으로 구속력을 갖게 되는가? 현재 시스템이 그 한계에 얼마나 근접해 있는가라는 실천적 질문은 여전히 열린 채로 남아 있다.

열린 질문들과 향후 방향

과정 기반 대안: 이 논문은 정렬의 재구성을 "가치 명세"에서 "가치 창발"로 전환할 것을 촉구한다. 과정 기반 정렬 방법론은 실제로 어떤 모습일까?

실증적 한계 탐지: 모델의 행동이 진정한 선호도 추종에서 명세 게이밍으로 전환되는 시점을 감지하는 실험을 설계할 수 있을까? 이는 철학적 논증을 실증적으로 검증 가능하게 만들 것이다.

혼합 접근법: 내용 기반 방법론이 정렬이 아닌 안전성을 제공한다면, 진정한 정렬을 추구하는 동시에 안전한 작동 한계를 확장하기 위해 과정 기반 방법론과 결합될 수 있을까?

이유 반응성 검사: Fischer와 Ravizza의 프레임워크는 진정으로 정렬된 시스템이 새로운 이유에 적절하게 반응할 것임을 시사한다. "이유 반응성"을 AI 시스템의 측정 가능한 속성으로 조작화할 수 있을까?

명세의 시간적 유효성: Constitutional AI의 원칙들은 특정 시점에 작성된다. 이 원칙들은 얼마나 빠르게 부적절해지는가? 가치 명세에 대한 측정 가능한 감쇠율이 존재하는가?

이것이 여러분의 연구에 갖는 의미

RLHF나 Constitutional AI를 연구하는 사람이라면, 이 논문이 여러분의 연구를 무효화하는 것이 아니라 재구성한다는 점을 알아야 한다. 여러분이 개발하고 있는 방법들은 안전 조치이며, 안전 조치는 중요하다. 그러나 이를 "정렬 솔루션"이라고 부르는 것은 결과 시스템의 견고성에 대한 근거 없는 확신을 만들어낼 수 있다.

정렬 연구자들에게 있어, 명세 함정(specification trap)은 이 분야가 형식적 가치 객체를 최적화하는 방법에 지나치게 집중되어 있고, 진정한 이유-반응성(reasons-responsiveness)을 발전시키는 방법에는 충분히 집중하지 못하고 있음을 시사한다. 앞으로 나아가는 길은 최적화 이론보다 심리철학(philosophy of mind)에서 더 많은 것을 차용해야 할 수도 있다.

관련 정렬 및 안전 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (1)

[1] Spizzirri, A. (2025). The Specification Trap: Why Content-Based AI Value Alignment Cannot Produce Robust Alignment. arXiv:2512.03048.

DOI Scholar