Paper ReviewAI & Machine LearningMachine/Deep Learning

Your Preferences Are Data: The Privacy Crisis in Reinforcement Learning from Human Feedback

When you tell an AI which response you prefer, you reveal your values, beliefs, and vulnerabilities. RLHF systems aggregate millions of such preference signals—creating a privacy risk that the alignment community has barely acknowledged. User-level differential privacy offers a path forward, but at a cost.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Every time a user indicates which AI response they prefer—by clicking a thumbs-up, choosing between alternatives, or simply continuing a conversation—they reveal something about themselves. Not just what they find helpful, but what they value, what they fear, what they believe, and what they're trying to accomplish. Aggregated across millions of interactions, these preference signals constitute a detailed map of human psychology—and they are the raw material from which RLHF alignment is built.

The alignment community has treated this data as a technical input: preference pairs that train reward models. It has not, with rare exceptions, treated it as what it also is: sensitive personal data that reveals intimate details about the humans who generated it. Zhang et al. are among the first to confront this oversight directly, proposing user-level differential privacy for RLHF—and their findings reveal a tension between alignment quality and privacy protection that the field must resolve.

The Privacy Threat Model

The privacy risks in RLHF are more subtle than typical data privacy concerns. The threat is not that an attacker will steal a database of preference labels. It is that the trained model itself memorizes and reveals information about individual users' preferences.

Consider: a model trained via RLHF on feedback from a user who consistently prefers responses sympathetic to a particular political viewpoint may, through its behavior, reveal that user's political orientation. A model trained on feedback from a user seeking mental health support may encode patterns that reveal that user's psychological state. The model becomes an implicit database of its training feedback—and anyone with access to the model can potentially extract information about the individuals who shaped it.

This is not a theoretical concern. Membership inference attacks on language models—techniques that determine whether specific data was used in training—have already been demonstrated. Applying these techniques to RLHF preference data could reveal which users provided feedback and what their preferences were.

User-Level Differential Privacy

Zhang et al.'s solution applies differential privacy at the user level—not just the example level. The distinction matters enormously. Example-level differential privacy protects individual preference pairs; user-level differential privacy protects the entire contribution of each user.

The mechanism works by adding calibrated noise to the gradient updates during reward model training, ensuring that the trained model's behavior would be essentially unchanged whether or not any single user's entire preference history were included. The privacy guarantee is formal: an adversary with access to the model cannot determine, with meaningful confidence, whether any specific user contributed to training.

The cost is performance. Differential privacy inherently introduces noise that degrades model quality. Zhang et al. quantify this tradeoff, demonstrating that stronger privacy protection (lower ε) leads to proportionally greater accuracy degradation in the reward model—a finding consistent with the well-established privacy-utility tradeoff in differential privacy literature.

This degradation is not uniformly distributed. The model loses fine-grained sensitivity to subtle preference distinctions while maintaining coarse-grained alignment. For most applications, this means the model remains helpful and safe but becomes less capable of capturing nuanced user preferences—arguably an acceptable tradeoff, but one that alignment researchers have not yet reckoned with.

The Crowd-Sourcing Dimension

Wong & Tan examine RLHF from the crowd-sourcing perspective, focusing on how diverse, large-scale human feedback can be efficiently aggregated for code generation alignment. Their approach integrates feedback from thousands of developers with varying expertise and preferences, raising questions about how individual contributions should be weighted and protected.

Their key finding: not all feedback is equally informative. Expert developers provide preference signals that are more consistent and more predictive of code quality than novice developers. But weighting expert feedback more heavily concentrates influence—and potentially privacy exposure—in a smaller group, making those individuals more identifiable.

The tension between feedback quality (weight expert opinions more) and privacy (protect all contributors equally) has no clean resolution. It requires explicit policy decisions about whose preferences matter more and what privacy guarantees each contributor deserves.

Claims and Evidence

Claim	Evidence	Verdict
RLHF preference data reveals personal information	Membership inference attacks demonstrated on LMs; applies to RLHF	✅ Supported
User-level DP can protect RLHF contributors	Zhang et al. demonstrate formal guarantees	✅ Supported
Privacy protection degrades alignment quality	Measurable accuracy loss that increases with stronger privacy guarantees	✅ Supported
Current RLHF systems provide meaningful privacy protection	No major RLHF deployment implements differential privacy	❌ Not provided
Expert feedback is more valuable than novice feedback	Wong & Tan show expertise predicts feedback quality	✅ Supported

Open Questions

Regulatory compliance: Does RLHF feedback constitute "personal data" under GDPR, CCPA, or similar regulations? If so, current RLHF practices may already be non-compliant. The legal analysis has not been performed.

Consent and disclosure: Do users who provide preference feedback understand that they are contributing to a training dataset? Is clicking "thumbs up" informed consent for inclusion in alignment training?

The right to be forgotten: GDPR grants users the right to have their data deleted. Can a user's preference contribution be removed from a trained reward model? Model unlearning for RLHF is an unsolved technical problem.

Federated RLHF: Can we train reward models without centralizing preference data? Federated learning approaches would keep each user's preferences on their device while still contributing to alignment—but the communication and coordination costs are substantial.

Privacy-alignment Pareto frontier: What is the optimal tradeoff between privacy protection and alignment quality? The answer likely depends on the deployment context—medical AI may require stronger privacy than entertainment applications.

What This Means for Your Research

For alignment researchers, privacy is no longer a concern you can defer to the deployment team. The choice of privacy mechanism affects the quality of alignment achievable—stronger privacy means coarser alignment. This tradeoff should be acknowledged and studied, not ignored.

For privacy researchers, RLHF represents a novel and consequential application domain. The preference data is high-dimensional, deeply personal, and generated in a context (interaction with an AI system) where users' expectations of privacy may differ from traditional data collection contexts.

For organizations deploying RLHF-trained models, the question is immediate: are you protecting the privacy of the humans whose preferences shaped your model? If not, you may be one regulatory inquiry or data breach away from a crisis that no amount of alignment research can remedy.

The uncomfortable truth: we have built an alignment paradigm that requires intimate knowledge of human preferences but provides no mechanism for protecting the humans who reveal those preferences. The technical solutions exist. The question is whether the industry has the will to implement them before external pressure forces the issue.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 저작물에 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

당신의 선호는 데이터이다: 인간 피드백 강화학습에서의 프라이버시 위기

사용자가 어떤 AI 응답을 선호하는지 표시할 때마다—엄지손가락 올리기를 클릭하거나, 대안들 중에서 선택하거나, 혹은 단순히 대화를 이어나가는 것만으로도—그들은 자신에 관한 무언가를 드러낸다. 단순히 무엇이 유용하다고 생각하는지뿐만 아니라, 무엇을 가치 있게 여기는지, 무엇을 두려워하는지, 무엇을 믿는지, 그리고 무엇을 이루고자 하는지까지. 수백만 건의 상호작용에 걸쳐 집계된 이러한 선호 신호들은 인간 심리에 관한 상세한 지도를 구성하며—이것이 바로 RLHF 정렬이 구축되는 원재료이다.

정렬 연구 커뮤니티는 이 데이터를 기술적 입력으로 다루어 왔다: 보상 모델을 훈련시키는 선호 쌍으로서 말이다. 그러나 극소수의 예외를 제외하면, 이 데이터가 동시에 무엇인지—즉, 그것을 생성한 인간에 관한 내밀한 세부 사항을 드러내는 민감한 개인 정보—에 대해서는 다루어지지 않았다. Zhang et al.은 이 간과된 문제를 정면으로 다루는 최초의 연구자들 중 하나로, RLHF를 위한 사용자 수준의 차등 프라이버시(differential privacy)를 제안하며, 그들의 연구 결과는 이 분야가 반드시 해결해야 할 정렬 품질과 프라이버시 보호 사이의 긴장을 드러낸다.

프라이버시 위협 모델

RLHF에서의 프라이버시 위험은 일반적인 데이터 프라이버시 우려보다 더 미묘하다. 위협은 공격자가 선호 레이블 데이터베이스를 탈취하는 데 있지 않다. 문제는 훈련된 모델 자체가 개별 사용자의 선호에 관한 정보를 기억하고 드러낸다는 데 있다.

생각해 보라: 특정 정치적 관점에 우호적인 응답을 일관되게 선호하는 사용자의 피드백으로 RLHF를 통해 훈련된 모델은, 그 행동을 통해 해당 사용자의 정치적 성향을 드러낼 수 있다. 정신 건강 지원을 구하는 사용자의 피드백으로 훈련된 모델은 그 사용자의 심리 상태를 드러내는 패턴을 인코딩할 수 있다. 모델은 훈련 피드백의 암묵적 데이터베이스가 되며—모델에 접근할 수 있는 누구라도 그것을 형성한 개인들에 관한 정보를 잠재적으로 추출할 수 있다.

이는 이론적 우려에 그치지 않는다. 언어 모델에 대한 멤버십 추론 공격(membership inference attack)—특정 데이터가 훈련에 사용되었는지 여부를 판별하는 기법—은 이미 시연된 바 있다. 이러한 기법을 RLHF 선호 데이터에 적용하면 어떤 사용자가 피드백을 제공했는지, 그리고 그들의 선호가 무엇이었는지를 드러낼 수 있다.

사용자 수준의 차등 프라이버시

Zhang et al.의 해법은 예시 수준이 아닌 사용자 수준에서 차등 프라이버시를 적용한다. 이 구분은 매우 중요하다. 예시 수준의 차등 프라이버시는 개별 선호 쌍을 보호하는 반면, 사용자 수준의 차등 프라이버시는 각 사용자의 전체 기여를 보호한다.

이 메커니즘은 보상 모델 훈련 중 그래디언트 업데이트에 교정된 노이즈를 추가하는 방식으로 작동하며, 단일 사용자의 전체 선호 이력이 포함되든 아니든 훈련된 모델의 행동이 본질적으로 변하지 않도록 보장한다. 프라이버시 보장은 형식적이다: 모델에 접근할 수 있는 적대자(adversary)는 특정 사용자가 훈련에 기여했는지 여부를 유의미한 확신을 가지고 판단할 수 없다.

그 대가는 성능이다. 차등 프라이버시는 필연적으로 모델 품질을 저하시키는 노이즈를 도입한다. Zhang et al.은 이 트레이드오프를 정량화하여, 더 강력한 프라이버시 보호(낮은 ε)가 보상 모델의 정확도 저하로 비례적으로 이어짐을 입증했다—이는 차등 프라이버시 문헌에서 잘 확립된 프라이버시-유용성 트레이드오프와 일치하는 결과이다.

이러한 저하는 균등하게 분포되지 않는다. 모델은 미묘한 선호 구분에 대한 세밀한 민감도를 잃는 반면, 대략적인 정렬은 유지한다. 대부분의 응용에서 이는 모델이 여전히 도움이 되고 안전하게 유지되지만, 미묘한 사용자 선호를 포착하는 능력은 떨어진다는 것을 의미한다—이는 논란의 여지가 있지만 수용 가능한 트레이드오프이나, 정렬 연구자들이 아직 충분히 고민하지 않은 문제이다.

크라우드소싱 차원

Wong & Tan은 크라우드소싱 관점에서 RLHF를 검토하며, 코드 생성 정렬(alignment)을 위해 다양하고 대규모의 인간 피드백을 효율적으로 집계하는 방법에 초점을 맞춘다. 그들의 접근법은 다양한 전문성과 선호를 가진 수천 명의 개발자로부터 피드백을 통합하며, 개별 기여에 대한 가중치 부여 및 보호 방식에 관한 문제를 제기한다.

그들의 핵심 발견: 모든 피드백이 동등하게 유익한 것은 아니다. 전문 개발자는 초보 개발자보다 더 일관되고 코드 품질을 더 잘 예측하는 선호 신호(preference signal)를 제공한다. 그러나 전문가 피드백에 더 높은 가중치를 부여하면 영향력—그리고 잠재적인 프라이버시 노출—이 더 소수의 집단에 집중되어, 해당 개인들이 더 쉽게 식별 가능해진다.

피드백 품질(전문가 의견에 더 높은 가중치 부여)과 프라이버시(모든 기여자를 동등하게 보호) 사이의 긴장은 명쾌한 해결책이 없다. 이는 누구의 선호가 더 중요한지, 그리고 각 기여자가 어떤 프라이버시 보장을 받을 자격이 있는지에 대한 명시적인 정책 결정을 요구한다.

주장과 근거

주장	근거	판정
RLHF 선호 데이터는 개인 정보를 드러낸다	언어 모델에 대한 멤버십 추론 공격(membership inference attack) 시연; RLHF에도 적용 가능	✅ 지지됨
사용자 수준 DP는 RLHF 기여자를 보호할 수 있다	Zhang et al.이 형식적 보장을 시연	✅ 지지됨
프라이버시 보호는 정렬 품질을 저하시킨다	더 강력한 프라이버시 보장일수록 증가하는 측정 가능한 정확도 손실	✅ 지지됨
현재 RLHF 시스템은 의미 있는 프라이버시 보호를 제공한다	주요 RLHF 배포 시스템 중 차등 프라이버시(differential privacy)를 구현한 사례 없음	❌ 제시되지 않음
전문가 피드백은 초보자 피드백보다 더 가치 있다	Wong & Tan이 전문성이 피드백 품질을 예측함을 시연	✅ 지지됨

미해결 질문들

규제 준수: RLHF 피드백은 GDPR, CCPA 또는 유사 규정상 "개인 데이터"에 해당하는가? 만약 그렇다면, 현재 RLHF 관행은 이미 비준수 상태일 수 있다. 법적 분석은 아직 수행되지 않았다.

동의와 공개: 선호 피드백을 제공하는 사용자는 자신이 훈련 데이터셋에 기여하고 있음을 이해하는가? "좋아요" 버튼을 클릭하는 것이 정렬 훈련에 포함되는 것에 대한 사전 동의(informed consent)인가?

잊혀질 권리: GDPR은 사용자에게 자신의 데이터를 삭제할 권리를 부여한다. 사용자의 선호 기여분을 학습된 보상 모델(reward model)에서 제거할 수 있는가? RLHF를 위한 모델 언러닝(model unlearning)은 아직 해결되지 않은 기술적 문제이다.

연합 RLHF(Federated RLHF): 선호 데이터를 중앙화하지 않고 보상 모델을 훈련할 수 있는가? 연합 학습(federated learning) 접근법은 각 사용자의 선호를 기기에 유지하면서도 정렬에 기여할 수 있게 하지만—통신 및 조정 비용이 상당하다.

프라이버시-정렬 파레토 프런티어(Privacy-alignment Pareto frontier): 프라이버시 보호와 정렬 품질 사이의 최적 트레이드오프는 무엇인가? 그 답은 배포 맥락에 따라 다를 가능성이 높다—의료 AI는 엔터테인먼트 애플리케이션보다 더 강력한 프라이버시를 요구할 수 있다.

연구에 대한 시사점

정렬 연구자에게 있어, 프라이버시는 더 이상 배포 팀에 미룰 수 있는 문제가 아니다. 프라이버시 메커니즘의 선택은 달성 가능한 정렬의 품질에 영향을 미친다—프라이버시가 강할수록 정렬은 더 거칠어진다. 이 트레이드오프는 무시되어서는 안 되며, 명시적으로 인정하고 연구해야 한다.

프라이버시 연구자에게 있어, RLHF는 새롭고 중요한 응용 분야이다. 선호 데이터는 고차원적이고 매우 개인적이며, 사용자의 프라이버시에 대한 기대가 전통적인 데이터 수집 맥락과 다를 수 있는 환경(AI 시스템과의 상호작용)에서 생성된다.

RLHF 학습 모델을 배포하는 조직에게 있어, 문제는 즉각적이다: 여러분은 자신의 모델을 형성한 인간들의 선호에 대한 프라이버시를 보호하고 있는가? 그렇지 않다면, 어떠한 정렬 연구도 해결할 수 없는 위기를 불러올 규제 조사나 데이터 유출로부터 한 발짝도 떨어져 있지 않을 수 있다. 불편한 진실은 이것이다: 우리는 인간의 선호를 깊이 이해해야 하는 정렬(alignment) 패러다임을 구축했지만, 그 선호를 드러내는 인간을 보호하는 메커니즘은 전혀 마련하지 않았다. 기술적 해결책은 존재한다. 문제는 외부 압력이 이 사안을 강제하기 전에 업계가 이를 실행할 의지가 있느냐이다.

References (3)

[1] Zhang, J., Lei, M., Ding, M. et al. (2025). Towards User-level Private Reinforcement Learning with Human Feedback. arXiv:2502.17515.

DOI Scholar

[2] Kleine Buening, T., Gan, J., Mandal, D. et al. (2025). Strategyproof Reinforcement Learning from Human Feedback. arXiv:2503.09561.

DOI Scholar

[3] Wong, M. & Tan, C. (2025). Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by LLMs. IEEE TBDATA.

DOI Scholar

Your Preferences Are Data: The Privacy Crisis in Reinforcement Learning from Human Feedback

The Privacy Threat Model

User-Level Differential Privacy

The Crowd-Sourcing Dimension

Claims and Evidence

Open Questions

What This Means for Your Research

당신의 선호는 데이터이다: 인간 피드백 강화학습에서의 프라이버시 위기

프라이버시 위협 모델

사용자 수준의 차등 프라이버시

크라우드소싱 차원

주장과 근거

미해결 질문들

연구에 대한 시사점

References (3)

Explore this topic deeper