Law & Policy

AI Training Data and Copyright: The Input Side of the Generative AI Legal Crisis

Generative AI models are trained on vast quantities of copyrighted material collected through web scraping. Whether this constitutes infringement depends on which jurisdiction you askโ€”and on legal doctrines (fair use, TDM exceptions) that were designed for a pre-generative world. Five papers map the legal landscape and its fractures.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Every large language model, every image generator, every music composition AI was trained on data that someone created. Much of that data is copyrighted. The legal questionโ€”whether using copyrighted works as AI training data constitutes infringementโ€”remains genuinely unsettled across every major jurisdiction. This is not a gap waiting to be filled by an obvious answer; it is a genuine conflict between two legitimate legal principles: the right of creators to control how their works are used, and the interest of society in fostering technological innovation.

The cases currently winding through courtsโ€”New York Times v. OpenAI, Getty Images v. Stability AI, Authors Guild v. Metaโ€”will produce precedents. But the scholarly literature suggests that no single case will resolve the underlying tension, because the legal frameworks being applied (fair use, text and data mining exceptions, the three-step test) were designed for a different technological reality.

The Jurisdictional Fracture

Dornis and Stober (2025) provide an interdisciplinary analysis that combines legal scholarship with technical understanding of how generative AI models actually use training data. Their paper examines the two dominant legal frameworks: the US "fair use" doctrine and the EU "text and data mining" (TDM) exceptionโ€”and argues that neither applies as straightforwardly as commonly assumed.

In the United States, AI developers rely on "fair use," which considers four factors: purpose and character of use, nature of the copyrighted work, amount used, and market effect. AI training arguably transforms the work (favoring fair use) but may substantially replicate it in outputs (disfavoring fair use). Fair use analysis is inherently unpredictableโ€”each case requires fact-specific analysis, and reasonable courts can reach opposite conclusions on the same facts.

In the European Union, the prevailing view is that the DSM Directive's TDM exception (Articles 3 and 4) applies to AI training. However, Dornis and Stober challenge this prevailing view, arguing that generative AI training fundamentally differs from TDM as traditionally understood. Their analysis suggests that the TDM exception may not cover the kind of large-scale pattern extraction that generative models perform. They also discuss how training data memorizationโ€”where models reproduce substantial portions of training data in outputsโ€”creates copyright issues independently from both the fair use and TDM exceptions.

The Three-Step Test Under Pressure

Thongmeensuk (2024) provides what has become an influential analysis of how existing copyright exceptions interact with generative AI's data requirements. The paper examines how TDM practices challenge the Berne Convention's three-step testโ€”the international standard that limits copyright exceptions to:

  • Certain special cases (the exception must be narrowly defined)
  • Not conflicting with normal exploitation (the exception must not substitute for the market for the work)
  • Not unreasonably prejudicing the legitimate interests of the rightsholder
  • The paper argues that generative AI creates multifaceted legal challenges at the intersection of data utilization and copyright law. The inherent reliance of AI on large quantities of data, often encompassing copyrighted materials, tests each prong of the three-step test in novel ways. When an AI system trained on millions of copyrighted images can generate new images that compete with the originals in the same markets, the second prongโ€”non-conflict with normal exploitationโ€”becomes particularly strained.

    Beyond Fair Use and Opt-Out

    Woo (2025) advances what is perhaps the most theoretically ambitious argument in this cohort: that generative AI represents the "de facto end of the Berne Convention era." The paper argues that existing copyright doctrinesโ€”fair use, TDM exceptions, the three-step testโ€”are not merely inadequate patches on a basically sound framework but symptoms of a fundamental mismatch between the assumptions of international copyright law and the reality of generative AI.

    The Berne Convention assumes that copying is detectable, attributable, and discreteโ€”that you can identify when a work has been copied, who copied it, and what was copied. Generative AI violates all three assumptions. Training is a statistical process that extracts patterns from millions of works simultaneously, making attribution to any single source technically challenging. The "copies" that exist in model weights are not copies in any traditional senseโ€”they are compressed statistical representations that may or may not be recoverable as recognizable reproductions.

    Woo argues that measures currently under discussionโ€”TDM exceptions, fair use, opt-out mechanismsโ€”are palliative at best. What is needed is a fundamental shift in the public paradigm of copyright: from exclusive rights over copies to equitable participation in the value generated from data.

    Pasetti et al. (2025) address the technical, legal, and ethical dimensions of AI training data governance simultaneously. Their contribution lies in bridging the gap between what computer scientists understand about model training and what legal scholars assume about it.

    The technical reality is important for legal analysis: AI training does not "store" copyrighted works in the traditional sense. The training process compresses billions of data points into model parameters through gradient descent, creating a statistical representation that is neither a copy (in the legal sense) nor independent of the originals (in the practical sense). This intermediate statusโ€”not-a-copy-but-not-independentโ€”is precisely what existing copyright frameworks are not equipped to handle.

    Cross-Jurisdictional Divergence

    Riaz (2026) provides a systematic comparative analysis across the UK, EU, and US, using doctrinal methodology to analyze statutes, case law, and regulatory proposals. The analysis reveals that jurisdictional divergence is increasing rather than converging:

    • The UK initially proposed a broad TDM exception for commercial use but withdrew it after creator backlash, leaving the legal position uncertain.
    • The EU has its opt-out framework but faces enforcement challengesโ€”how do rightsholders monitor whether their opt-out declarations are being respected?
    • The US relies on case-by-case fair use adjudication, with pending cases that could establish divergent precedents depending on whether courts emphasize transformation (favoring AI developers) or market substitution (favoring creators).
    The practical consequence of divergence is regulatory arbitrage: AI companies can train models in jurisdictions with permissive frameworks and deploy them globally. This possibility limits the effectiveness of any single jurisdiction's regulatory choices and creates pressure for international harmonizationโ€”which the Berne Convention's existing machinery is not designed to provide.

    Claims and Evidence

    <
    ClaimEvidenceVerdict
    AI training constitutes fair use under US lawDornis & Stober (2025): fact-specific, inherently unpredictable; reasonable disagreement possibleโš ๏ธ Uncertain (pending litigation)
    The EU TDM opt-out mechanism adequately protects creatorsThongmeensuk (2024), Riaz (2026): enforcement challenges and power asymmetries identifiedโš ๏ธ Uncertain
    Existing copyright frameworks can accommodate generative AIWoo (2025): fundamental mismatch with Berne Convention assumptionsโŒ Refuted (as currently configured)
    Technical understanding of AI training changes the legal analysisPasetti et al. (2025): model weights are neither copies nor independent creationsโœ… Supported
    Jurisdictional harmonization on AI training data is emergingRiaz (2026): divergence is increasing across UK, EU, and USโŒ Refuted

    Open Questions

  • Will the pending US cases establish a clear precedent, or will they fragment the analysis further? NYT v. OpenAI focuses on memorization and market substitution; Authors Guild v. Meta focuses on transformative use. Different facts may produce different doctrinal outcomes.
  • Can technical measures substitute for legal solutions? Content provenance standards (C2PA), training data provenance tracking, and output watermarking offer technical infrastructure for accountability. But their effectiveness depends on universal adoption, which is voluntary.
  • Should AI training compensation be collective or individual? Collective licensing (analogous to music performing rights organizations) could provide scalable compensation. But who would represent the interests of the millions of creators whose works are used as training data?
  • What happens to works that are not opted out? Under the EU framework, works without an explicit opt-out declaration are available for TDM. Does this create a default that disadvantages individual creators who lack the technical knowledge or resources to opt out?
  • Is the distinction between input (training) and output (generation) legally coherent? Current analyses treat training and generation as separate legal events. But from a technical perspective, the output is a function of the inputโ€”separating them may be analytically convenient but practically misleading.
  • Implications

    The legal status of AI training data will determine how the economic value generated by generative AI is distributed between AI companies and content creators. If training is broadly permissible (under fair use or TDM exceptions), the value flows to AI developers and their users. If training requires licensing, the value is sharedโ€”but the transaction costs of licensing millions of works may be prohibitive without collective mechanisms.

    The research reviewed here suggests that the current legal frameworksโ€”designed for a world of identifiable copies, discrete uses, and national jurisdictionsโ€”are not adequate for a technology that compresses millions of works into statistical representations, deploys them globally, and generates outputs that blur the line between derivation and creation. What is needed is not marginal reform but conceptual innovation: new legal categories that account for the technical reality of AI training and the economic reality of generative AI markets.

    References (5)

    [1] Thongmeensuk, S. (2024). Rethinking Copyright Exceptions in the Era of Generative AI: Balancing Innovation and Intellectual Property Protection. Journal of World Intellectual Property, 27(4).
    [2] Dornis, T.W. & Stober, S. (2025). Generative AI Training and Copyright Law. arXiv:2502.15858.
    [3] Pasetti, M., Santos, J.W., Corrรชa, N., de Oliveira, N., & Barbosa, C. (2025). Technical, Legal, and Ethical Challenges of Generative AI: An Analysis of the Governance of Training Data and Copyrights. Discover Artificial Intelligence, 5, 379.
    [4] Riaz, C.H. (2026). The Legal Status of AI Training Data: A Cross-Jurisdictional Analysis of Copyright, Fair Use, and Text-and-Data Mining. International Journal of Science and Research Archive, 18(1), 166.
    [5] Woo, M. (2025). Generative AI and Copyright Law: The De Facto End of the Berne Convention Era and the Need for a Shift in the Public Paradigm. Korean Digital Property Studies, 38(3), 41.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’