Law & Policy

AI Training Data and Copyright: The Input Side of the Generative AI Legal Crisis

Generative AI models are trained on vast quantities of copyrighted material collected through web scraping. Whether this constitutes infringement depends on which jurisdiction you ask—and on legal doctrines (fair use, TDM exceptions) that were designed for a pre-generative world. Five papers map the legal landscape and its fractures.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Every large language model, every image generator, every music composition AI was trained on data that someone created. Much of that data is copyrighted. The legal question—whether using copyrighted works as AI training data constitutes infringement—remains genuinely unsettled across every major jurisdiction. This is not a gap waiting to be filled by an obvious answer; it is a genuine conflict between two legitimate legal principles: the right of creators to control how their works are used, and the interest of society in fostering technological innovation.

The cases currently winding through courts—New York Times v. OpenAI, Getty Images v. Stability AI, Authors Guild v. Meta—will produce precedents. But the scholarly literature suggests that no single case will resolve the underlying tension, because the legal frameworks being applied (fair use, text and data mining exceptions, the three-step test) were designed for a different technological reality.

The Jurisdictional Fracture

Dornis and Stober (2025) provide an interdisciplinary analysis that combines legal scholarship with technical understanding of how generative AI models actually use training data. Their paper examines the two dominant legal frameworks: the US "fair use" doctrine and the EU "text and data mining" (TDM) exception—and argues that neither applies as straightforwardly as commonly assumed.

In the United States, AI developers rely on "fair use," which considers four factors: purpose and character of use, nature of the copyrighted work, amount used, and market effect. AI training arguably transforms the work (favoring fair use) but may substantially replicate it in outputs (disfavoring fair use). Fair use analysis is inherently unpredictable—each case requires fact-specific analysis, and reasonable courts can reach opposite conclusions on the same facts.

In the European Union, the prevailing view is that the DSM Directive's TDM exception (Articles 3 and 4) applies to AI training. However, Dornis and Stober challenge this prevailing view, arguing that generative AI training fundamentally differs from TDM as traditionally understood. Their analysis suggests that the TDM exception may not cover the kind of large-scale pattern extraction that generative models perform. They also discuss how training data memorization—where models reproduce substantial portions of training data in outputs—creates copyright issues independently from both the fair use and TDM exceptions.

The Three-Step Test Under Pressure

Thongmeensuk (2024) provides what has become an influential analysis of how existing copyright exceptions interact with generative AI's data requirements. The paper examines how TDM practices challenge the Berne Convention's three-step test—the international standard that limits copyright exceptions to:

Certain special cases (the exception must be narrowly defined)

Not conflicting with normal exploitation (the exception must not substitute for the market for the work)

Not unreasonably prejudicing the legitimate interests of the rightsholder

The paper argues that generative AI creates multifaceted legal challenges at the intersection of data utilization and copyright law. The inherent reliance of AI on large quantities of data, often encompassing copyrighted materials, tests each prong of the three-step test in novel ways. When an AI system trained on millions of copyrighted images can generate new images that compete with the originals in the same markets, the second prong—non-conflict with normal exploitation—becomes particularly strained.

Beyond Fair Use and Opt-Out

Woo (2025) advances what is perhaps the most theoretically ambitious argument in this cohort: that generative AI represents the "de facto end of the Berne Convention era." The paper argues that existing copyright doctrines—fair use, TDM exceptions, the three-step test—are not merely inadequate patches on a basically sound framework but symptoms of a fundamental mismatch between the assumptions of international copyright law and the reality of generative AI.

The Berne Convention assumes that copying is detectable, attributable, and discrete—that you can identify when a work has been copied, who copied it, and what was copied. Generative AI violates all three assumptions. Training is a statistical process that extracts patterns from millions of works simultaneously, making attribution to any single source technically challenging. The "copies" that exist in model weights are not copies in any traditional sense—they are compressed statistical representations that may or may not be recoverable as recognizable reproductions.

Woo argues that measures currently under discussion—TDM exceptions, fair use, opt-out mechanisms—are palliative at best. What is needed is a fundamental shift in the public paradigm of copyright: from exclusive rights over copies to equitable participation in the value generated from data.

The Technical-Legal Interface

Pasetti et al. (2025) address the technical, legal, and ethical dimensions of AI training data governance simultaneously. Their contribution lies in bridging the gap between what computer scientists understand about model training and what legal scholars assume about it.

The technical reality is important for legal analysis: AI training does not "store" copyrighted works in the traditional sense. The training process compresses billions of data points into model parameters through gradient descent, creating a statistical representation that is neither a copy (in the legal sense) nor independent of the originals (in the practical sense). This intermediate status—not-a-copy-but-not-independent—is precisely what existing copyright frameworks are not equipped to handle.

Cross-Jurisdictional Divergence

Riaz (2026) provides a systematic comparative analysis across the UK, EU, and US, using doctrinal methodology to analyze statutes, case law, and regulatory proposals. The analysis reveals that jurisdictional divergence is increasing rather than converging:

The UK initially proposed a broad TDM exception for commercial use but withdrew it after creator backlash, leaving the legal position uncertain.
The EU has its opt-out framework but faces enforcement challenges—how do rightsholders monitor whether their opt-out declarations are being respected?
The US relies on case-by-case fair use adjudication, with pending cases that could establish divergent precedents depending on whether courts emphasize transformation (favoring AI developers) or market substitution (favoring creators).

The practical consequence of divergence is regulatory arbitrage: AI companies can train models in jurisdictions with permissive frameworks and deploy them globally. This possibility limits the effectiveness of any single jurisdiction's regulatory choices and creates pressure for international harmonization—which the Berne Convention's existing machinery is not designed to provide.

Claims and Evidence

Claim	Evidence	Verdict
AI training constitutes fair use under US law	Dornis & Stober (2025): fact-specific, inherently unpredictable; reasonable disagreement possible	⚠️ Uncertain (pending litigation)
The EU TDM opt-out mechanism adequately protects creators	Thongmeensuk (2024), Riaz (2026): enforcement challenges and power asymmetries identified	⚠️ Uncertain
Existing copyright frameworks can accommodate generative AI	Woo (2025): fundamental mismatch with Berne Convention assumptions	❌ Refuted (as currently configured)
Technical understanding of AI training changes the legal analysis	Pasetti et al. (2025): model weights are neither copies nor independent creations	✅ Supported
Jurisdictional harmonization on AI training data is emerging	Riaz (2026): divergence is increasing across UK, EU, and US	❌ Refuted

Open Questions

Will the pending US cases establish a clear precedent, or will they fragment the analysis further? NYT v. OpenAI focuses on memorization and market substitution; Authors Guild v. Meta focuses on transformative use. Different facts may produce different doctrinal outcomes.

Can technical measures substitute for legal solutions? Content provenance standards (C2PA), training data provenance tracking, and output watermarking offer technical infrastructure for accountability. But their effectiveness depends on universal adoption, which is voluntary.

Should AI training compensation be collective or individual? Collective licensing (analogous to music performing rights organizations) could provide scalable compensation. But who would represent the interests of the millions of creators whose works are used as training data?

What happens to works that are not opted out? Under the EU framework, works without an explicit opt-out declaration are available for TDM. Does this create a default that disadvantages individual creators who lack the technical knowledge or resources to opt out?

Is the distinction between input (training) and output (generation) legally coherent? Current analyses treat training and generation as separate legal events. But from a technical perspective, the output is a function of the input—separating them may be analytically convenient but practically misleading.

Implications

The legal status of AI training data will determine how the economic value generated by generative AI is distributed between AI companies and content creators. If training is broadly permissible (under fair use or TDM exceptions), the value flows to AI developers and their users. If training requires licensing, the value is shared—but the transaction costs of licensing millions of works may be prohibitive without collective mechanisms.

The research reviewed here suggests that the current legal frameworks—designed for a world of identifiable copies, discrete uses, and national jurisdictions—are not adequate for a technology that compresses millions of works into statistical representations, deploys them globally, and generates outputs that blur the line between derivation and creation. What is needed is not marginal reform but conceptual innovation: new legal categories that account for the technical reality of AI training and the economic reality of generative AI markets.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 작업에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

AI 학습 데이터와 저작권: 생성형 AI 법적 위기의 입력 측면

모든 대형 언어 모델, 모든 이미지 생성기, 모든 음악 작곡 AI는 누군가가 창작한 데이터를 기반으로 학습되었다. 그 데이터 중 상당 부분은 저작권으로 보호된다. 저작권이 있는 저작물을 AI 학습 데이터로 사용하는 것이 침해에 해당하는지에 관한 법적 문제는 모든 주요 관할권에서 아직 진정으로 해결되지 않은 상태이다. 이는 명백한 답이 채워지기를 기다리는 공백이 아니라, 두 가지 정당한 법 원리 사이의 진정한 충돌이다. 즉, 창작자가 자신의 저작물이 사용되는 방식을 통제할 권리와, 기술 혁신을 촉진하려는 사회적 이익 사이의 충돌이다.

현재 법원에서 진행 중인 사건들—New York Times v. OpenAI, Getty Images v. Stability AI, Authors Guild v. Meta—은 판례를 만들어낼 것이다. 그러나 학술 문헌은 어떤 단일 사건도 근본적인 긴장을 해소하지 못할 것임을 시사한다. 적용되는 법적 틀(공정 이용, 텍스트 및 데이터 마이닝 예외, 3단계 테스트)이 다른 기술적 현실을 위해 설계되었기 때문이다.

관할권의 분열

Dornis와 Stober(2025)는 생성형 AI 모델이 실제로 학습 데이터를 활용하는 방식에 대한 기술적 이해와 법학적 연구를 결합한 학제적 분석을 제공한다. 이 논문은 두 가지 지배적인 법적 틀, 즉 미국의 "공정 이용(fair use)" 원칙과 EU의 "텍스트 및 데이터 마이닝(TDM)" 예외를 검토하며, 둘 다 일반적으로 가정하는 것만큼 간단하게 적용되지 않는다고 주장한다.

미국에서 AI 개발자들은 사용 목적과 성격, 저작물의 성격, 사용된 분량, 시장 영향 등 네 가지 요소를 고려하는 "공정 이용"에 의존한다. AI 학습은 저작물을 변형한다는 점에서는 공정 이용에 유리하지만, 출력물에서 저작물을 상당 부분 복제할 수 있다는 점에서는 불리하다. 공정 이용 분석은 본질적으로 예측하기 어렵다. 각 사건은 사실에 특정한 분석을 필요로 하며, 합리적인 법원도 동일한 사실에 대해 반대 결론에 도달할 수 있다.

유럽연합에서는 DSM 지침의 TDM 예외(제3조 및 제4조)가 AI 학습에 적용된다는 견해가 지배적이다. 그러나 Dornis와 Stober는 이 지배적인 견해에 이의를 제기하며, 생성형 AI 학습이 전통적으로 이해되어온 TDM과 근본적으로 다르다고 주장한다. 그들의 분석은 TDM 예외가 생성형 모델이 수행하는 대규모 패턴 추출 방식을 포괄하지 못할 수 있음을 시사한다. 또한 모델이 출력물에서 학습 데이터의 상당 부분을 재현하는 학습 데이터 기억화(memorization)가 공정 이용 및 TDM 예외와 독립적으로 저작권 문제를 야기한다는 점도 논의한다.

압박받는 3단계 테스트

Thongmeensuk(2024)은 기존 저작권 예외가 생성형 AI의 데이터 요구 사항과 어떻게 상호작용하는지에 대한 영향력 있는 분석을 제공한다. 이 논문은 TDM 관행이 베른 협약의 3단계 테스트—저작권 예외를 다음으로 제한하는 국제 기준—에 어떻게 도전하는지를 검토한다:

특정 특수한 경우 (예외는 좁게 정의되어야 한다)

통상적인 이용과의 충돌 금지 (예외는 저작물 시장을 대체해서는 안 된다)

권리 보유자의 정당한 이익을 불합리하게 해치지 않을 것

이 논문은 생성형 AI가 데이터 활용과 저작권법의 교차점에서 다면적인 법적 과제를 만들어낸다고 주장한다. 흔히 저작권 있는 자료를 포함하는 대량의 데이터에 대한 AI의 본질적인 의존은 3단계 테스트의 각 요건을 새로운 방식으로 시험한다. 수백만 개의 저작권 있는 이미지로 학습된 AI 시스템이 동일한 시장에서 원본과 경쟁하는 새로운 이미지를 생성할 수 있을 때, 두 번째 요건—통상적인 이용과의 비충돌—은 특히 큰 압박을 받게 된다.

공정 이용과 옵트아웃을 넘어서

Woo(2025)는 이 연구군에서 아마도 가장 이론적으로 야심찬 주장을 제시한다: 생성형 AI가 "사실상 베른 협약 시대의 종말"을 의미한다는 것이다. 이 논문은 공정 이용, TDM 예외, 3단계 테스트 등 기존 저작권 법리가 기본적으로 건전한 틀에 붙이는 불충분한 임시방편에 그치는 것이 아니라, 국제 저작권법의 전제와 생성형 AI의 현실 사이의 근본적인 불일치를 보여주는 징후라고 주장한다.

베른 협약은 복제가 탐지 가능하고, 귀속 가능하며, 개별적이라고 가정한다—즉, 저작물이 복제된 시점, 복제한 주체, 복제된 내용을 식별할 수 있다고 전제한다. 생성형 AI는 이 세 가지 가정을 모두 위반한다. 훈련은 수백만 개의 저작물에서 동시에 패턴을 추출하는 통계적 과정으로, 어떤 단일 출처에 대한 귀속을 기술적으로 어렵게 만든다. 모델 가중치에 존재하는 "복제물"은 전통적인 의미의 복제물이 아니다—그것은 알아볼 수 있는 재현물로 복원 가능할 수도 있고 그렇지 않을 수도 있는 압축된 통계적 표현이다.

Woo는 현재 논의 중인 조치들—TDM 예외, 공정 이용, 옵트아웃 메커니즘—이 기껏해야 대증요법에 불과하다고 주장한다. 필요한 것은 저작권에 관한 공적 패러다임의 근본적인 전환이다: 복제물에 대한 배타적 권리에서 데이터로부터 창출된 가치에의 공정한 참여로의 전환이다.

기술-법률 접점

Pasetti et al.(2025)은 AI 훈련 데이터 거버넌스의 기술적, 법적, 윤리적 차원을 동시에 다룬다. 이들의 기여는 컴퓨터 과학자들이 모델 훈련에 대해 이해하는 바와 법학자들이 이를 가정하는 바 사이의 간극을 메우는 데 있다.

기술적 현실은 법적 분석에 있어 중요하다: AI 훈련은 전통적인 의미에서 저작물을 "저장"하지 않는다. 훈련 과정은 경사 하강법을 통해 수십억 개의 데이터 포인트를 모델 파라미터로 압축하여, 법적 의미에서의 복제물도 아니고 실용적 의미에서 원본으로부터 독립적이지도 않은 통계적 표현을 생성한다. 이러한 중간적 지위—복제물은 아니지만 독립적이지도 않은—가 바로 기존 저작권 체계가 다루기에 적합하지 않은 지점이다.

관할권 간 분기

Riaz(2026)는 법리적 방법론을 활용하여 법령, 판례법, 규제 제안을 분석함으로써 영국, EU, 미국에 걸친 체계적인 비교 분석을 제공한다. 이 분석은 관할권 간 분기가 수렴보다는 확대되고 있음을 보여준다:

영국은 처음에 상업적 이용을 위한 광범위한 TDM 예외를 제안했으나 창작자들의 반발로 이를 철회하여, 법적 입장이 불확실한 상태로 남아 있다.
EU는 옵트아웃 체계를 갖추고 있으나 집행 과제에 직면해 있다—권리자들이 자신들의 옵트아웃 선언이 준수되고 있는지 어떻게 모니터링할 수 있는가?
미국은 사안별 공정 이용 판단에 의존하며, 법원이 변형성(AI 개발자에게 유리)을 강조하는지 시장 대체(창작자에게 유리)를 강조하는지에 따라 상이한 선례를 확립할 수 있는 계류 중인 사건들이 있다.

분기의 실질적인 결과는 규제 차익거래이다: AI 기업들은 허용적인 체계를 가진 관할권에서 모델을 훈련시키고 전 세계에 배포할 수 있다. 이러한 가능성은 어떤 단일 관할권의 규제 선택의 실효성을 제한하고 국제적 조화에 대한 압력을 만들어 내는데—베른 협약의 기존 체계는 이를 제공하도록 설계되지 않았다.

주장과 근거

주장	근거	판정
AI 훈련은 미국법상 공정 이용에 해당한다	Dornis & Stober(2025): 사실 관계 특정적이며, 본질적으로 예측 불가능하고, 합리적인 견해 차이가 가능하다	⚠️ 불확실 (소송 계류 중)
EU TDM 옵트아웃 메커니즘은 창작자를 충분히 보호한다	Thongmeensuk(2024), Riaz(2026): 집행 과제 및 권력 비대칭 확인됨	⚠️ 불확실
기존 저작권 체계는 생성형 AI를 수용할 수 있다	Woo(2025): 베른 협약의 전제와의 근본적인 불일치	❌ 반박됨 (현재 구성 기준)
AI 학습에 대한 기술적 이해가 법적 분석을 변화시킨다	Pasetti et al. (2025): 모델 가중치는 복제물도 독립적 창작물도 아니다	✅ 지지됨
AI 학습 데이터에 관한 관할권 조화가 이루어지고 있다	Riaz (2026): 영국, EU, 미국 간 격차가 심화되고 있다	❌ 반박됨

미해결 질문

계류 중인 미국 사건들이 명확한 선례를 확립할 것인가, 아니면 분석을 더욱 분열시킬 것인가? NYT v. OpenAI는 기억화(memorization)와 시장 대체에 초점을 맞추고 있으며, Authors Guild v. Meta는 변형적 이용(transformative use)에 초점을 맞추고 있다. 서로 다른 사실관계는 서로 다른 법리적 결과를 낳을 수 있다.

기술적 수단이 법적 해결책을 대체할 수 있는가? 콘텐츠 출처 표준(C2PA), 학습 데이터 출처 추적, 출력물 워터마킹은 책임 추적을 위한 기술적 인프라를 제공한다. 그러나 이들의 효과는 자발적인 성격을 띠는 보편적 채택에 달려 있다.

AI 학습에 대한 보상은 집단적이어야 하는가, 개인적이어야 하는가? 집단 라이선싱(음악 공연권 단체와 유사한 방식)은 확장 가능한 보상을 제공할 수 있다. 그러나 학습 데이터로 사용된 수백만 창작자들의 이익을 누가 대표할 것인가?

옵트아웃(opt-out)하지 않은 저작물은 어떻게 되는가? EU 체계 하에서, 명시적인 옵트아웃 선언이 없는 저작물은 TDM(텍스트 및 데이터 마이닝)에 활용 가능하다. 이는 옵트아웃할 기술적 지식이나 자원이 부족한 개인 창작자들에게 불리한 기본값(default)을 형성하는가?

입력(학습)과 출력(생성) 간의 구분은 법적으로 일관성이 있는가? 현재의 분석은 학습과 생성을 별개의 법적 사건으로 다루고 있다. 그러나 기술적 관점에서 출력물은 입력의 함수이며, 이 둘을 분리하는 것은 분석적으로는 편리하지만 실질적으로는 오해를 불러일으킬 수 있다.

시사점

AI 학습 데이터의 법적 지위는 생성형 AI가 창출하는 경제적 가치가 AI 기업과 콘텐츠 창작자 사이에서 어떻게 분배되는지를 결정할 것이다. 학습이 광범위하게 허용된다면(공정 이용 또는 TDM 예외 조항에 따라), 가치는 AI 개발자와 그 이용자에게 귀속된다. 학습에 라이선싱이 요구된다면 가치는 공유되지만, 수백만 저작물에 대한 라이선싱 거래비용은 집단적 메커니즘 없이는 감당하기 어려울 수 있다.

본 연구에서 검토한 문헌들은 현행 법적 체계가, 즉 식별 가능한 복제물, 개별적 이용, 국가적 관할권을 전제로 설계된 체계가, 수백만 저작물을 통계적 표현으로 압축하고, 전 세계적으로 배포하며, 파생과 창작의 경계를 흐리는 출력물을 생성하는 기술에 적합하지 않음을 시사한다. 필요한 것은 점진적 개혁이 아니라 개념적 혁신, 즉 AI 학습의 기술적 현실과 생성형 AI 시장의 경제적 현실을 반영하는 새로운 법적 범주이다.

References (5)

[1] Thongmeensuk, S. (2024). Rethinking Copyright Exceptions in the Era of Generative AI: Balancing Innovation and Intellectual Property Protection. Journal of World Intellectual Property, 27(4).

DOI Scholar

[2] Dornis, T.W. & Stober, S. (2025). Generative AI Training and Copyright Law. arXiv:2502.15858.

DOI Scholar

[3] Pasetti, M., Santos, J.W., Corrêa, N., de Oliveira, N., & Barbosa, C. (2025). Technical, Legal, and Ethical Challenges of Generative AI: An Analysis of the Governance of Training Data and Copyrights. Discover Artificial Intelligence, 5, 379.

DOI Scholar

[4] Riaz, C.H. (2026). The Legal Status of AI Training Data: A Cross-Jurisdictional Analysis of Copyright, Fair Use, and Text-and-Data Mining. International Journal of Science and Research Archive, 18(1), 166.

DOI Scholar

[5] Woo, M. (2025). Generative AI and Copyright Law: The De Facto End of the Berne Convention Era and the Need for a Shift in the Public Paradigm. Korean Digital Property Studies, 38(3), 41.

DOI Scholar

AI Training Data and Copyright: The Input Side of the Generative AI Legal Crisis

The Jurisdictional Fracture

The Three-Step Test Under Pressure

Beyond Fair Use and Opt-Out

The Technical-Legal Interface

Cross-Jurisdictional Divergence

Claims and Evidence

Open Questions

Implications

AI 학습 데이터와 저작권: 생성형 AI 법적 위기의 입력 측면

관할권의 분열

압박받는 3단계 테스트

공정 이용과 옵트아웃을 넘어서

기술-법률 접점

관할권 간 분기

주장과 근거

미해결 질문

시사점

References (5)

Explore this topic deeper