AI Training Lawsuits Aren’t Really About “Fair Use” Anymore — They’re a Discovery War Over the One Dataset You’re Not Allowed to See
“Fair use” drives the headlines, but discovery drives leverage. The real fight is over what must be preserved, produced, and explained—then locked behind Attorneys’ Eyes Only.

Key Points
- 1Track discovery, not just fair use: motions to compel datasets, retention logs, and pipeline documentation increasingly decide leverage and proof.
- 2Expect secrecy even after “wins”: Highly Confidential – Attorneys’ Eyes Only and secure-room protocols can make produced datasets functionally unseen.
- 3Watch MDL spillover: the April 3, 2025 consolidation can standardize discovery rules that scale across many plaintiffs and shape the industry dispute.
For years, the public story of AI copyright litigation has been a single, magnetic phrase: fair use. It’s the headline hook, the cable-news shorthand, the question people ask at dinner. Can a model learn from copyrighted work without permission?
Inside the courthouse, though, the fights that decide momentum often sound less philosophical and more procedural. They turn on what a company must preserve, produce, and explain about what went into a model—and what the model did after it shipped.
That shift matters because it changes who holds leverage. Plaintiffs argue they cannot prove copying, substantial similarity, or market harm without seeing what was ingested and how. Defendants argue plaintiffs are trying to force disclosure of the “secret sauce”: enormous training corpora, internal pipelines, and proprietary decision-making that are costly and risky to expose.
“In AI copyright cases, ‘fair use’ gets the headlines. Discovery gets the leverage.”
— — TheMurrow
The result is an unusual kind of visibility. Courts can order production of a training dataset and still keep it effectively out of reach—sealed behind protective orders, available only under strict protocols, and often restricted to a narrow set of attorneys. The most important evidence in the case becomes legally visible yet operationally opaque.
The center of gravity moves from “fair use” to discovery
Debevoise’s analysis of recent disputes captures the structural tension. U.S. discovery is broad, and plaintiffs insist they need real evidence of what the model maker did—training data sources, retention practices, internal evaluations, and more. Defendants counter that training corpora are enormous, disclosure is burdensome, and producing data at scale risks revealing trade secrets. The burden argument isn’t theatrical; training datasets can be so large that even defining “the dataset” becomes a technical and legal question.
Why discovery is where cases are won or lost
Plaintiffs’ core problem is informational. Without access to the data pipeline, it’s difficult to show:
- whether copyrighted works were copied into a training corpus,
- how those copies were stored or retained,
- whether outputs can substitute for originals in a way that drives market harm.
Defendants’ core problem is exposure. Even if a company believes training is lawful, broad production can reveal:
- proprietary curation methods,
- internal risk assessments,
- the identities and composition of sources.
The MDL effect: one ruling can set the rules for many cases
Key statistic #1: The MDL Transfer Order is dated April 3, 2025—a concrete marker that the court system expects discovery to be sprawling and shared across cases.
“In an MDL, a single discovery protocol can become the de facto rulebook for an entire industry dispute.”
— — TheMurrow
The “one dataset you’re not allowed to see”
Reporting on OpenAI-related litigation describes discovery protocols that read like a high-security exam. Inspection may occur in a secure room on a locked-down computer with no internet access. Limits can apply to devices, copying, and who can participate. The designation that drives this is familiar to litigators and alien to most readers: Highly Confidential – Attorneys’ Eyes Only.
What “Attorneys’ Eyes Only” really means
That restriction is not mere legal theater. Defendants argue that training data and curation logic are trade secrets. Courts often accept that premise while still requiring enough production to allow the case to proceed.
The upshot is an accountability paradox. The court can compel disclosure, yet the public record remains thin, and even the plaintiffs themselves may rely on filtered summaries from their legal teams.
Why this matters beyond the courtroom
Practical implication: as AI cases progress, expect fewer definitive public disclosures than the headlines suggest, even when plaintiffs “win” motions to compel.
Key Insight
The GPT‑4 training dataset order—and what it signals
Key statistic #2: The order date—January 27, 2025—matters because it marks a moment when a court required production of core training materials rather than accepting narrower substitutes.
OpenAI argued for a compromise approach, proposing what Debevoise describes as a limited “20,000-word solution” rather than full production. The dispute reflects a recurring pattern: defendants offer partial disclosures, plaintiffs argue partials won’t do.
Why “produce the dataset” is not a simple instruction
- raw source lists,
- processed corpora,
- logs showing ingestion and filtering,
- documentation of how components were assembled.
Courts have to translate technical systems into discoverable categories. Defendants often frame full production as burdensome and risky. Plaintiffs frame it as the only path to proving what happened.
The public lesson: a big order doesn’t guarantee transparency
“A dataset can be ‘produced’ and still remain functionally unseen.”
— — TheMurrow
The second dataset war: user chat logs and output evidence
In publisher litigation, including The New York Times v. OpenAI/Microsoft, preservation and retention have become flashpoints. OpenAI has publicly stated it was “no longer under a legal order to retain consumer ChatGPT and API content indefinitely,” and characterized The New York Times’ demand as seeking retention for a specific window—OpenAI cites April–September 2025.
Key statistic #3: The contested retention window described by OpenAI spans April–September 2025.
Why outputs matter to market-harm arguments
- Did users prompt models for passages that resemble copyrighted text?
- How often did the model respond with content that competes with the original?
- Were there guardrails, and did they work?
Plaintiffs often argue that without logs, the defendant can deny harm while holding the best evidence. Defendants reply that retaining large volumes of user content creates privacy, security, and compliance risks.
Preservation is not neutral—it shapes what can be proven
The tension is real. Broad preservation may protect evidence but expand the amount of sensitive user data held for longer than intended. Narrow preservation may reduce privacy exposure but limit what plaintiffs can prove.
Practical implication for readers: if you want to know whether AI tools are “replacing” certain kinds of content, the best empirical evidence may sit in retention and logging systems—and those systems are now litigation terrain.
Consolidation turns discovery into an industry-wide stress test
The JPML Transfer Order dated April 3, 2025 explicitly points to shared factual questions and the prospect of “overlapping, complex, and voluminous discovery.” That is the court system acknowledging that the hard part is not a single legal question—it’s the fact-finding across technical systems and corporate practices.
Key statistic #4: The MDL transfer consolidates “numerous” actions into one coordinated proceeding, elevating the impact of any discovery order across plaintiffs and claims. (The order emphasizes the discovery volume and complexity as a principal justification.)
What defendants fear in MDL discovery
- one plaintiff’s discovery theory becomes everyone’s,
- one protocol exposes a broader slice of proprietary infrastructure,
- one adverse preservation ruling increases operational burdens across products.
What plaintiffs gain—and what they still can’t get
The larger implication is sobering: the most significant public debate about AI training may be litigated largely in private, with critical evidence reviewed behind sealed doors.
Accountability Paradox
Fair use still matters—and still depends on facts plaintiffs seek in discovery
The ruling shows why discovery remains central even when the legal doctrine appears to favor model developers. Fair use is famously fact-dependent. Plaintiffs will try to show facts that push a court toward skepticism: improper acquisition, retention of copies, or outputs that compete with originals.
The facts that can tilt a fair-use analysis
- how data was acquired (licensed, scraped, or sourced from “shadow libraries”),
- whether copies were retained in ways that go beyond transient processing,
- whether internal documents suggest awareness of risk or willfulness,
- whether the system’s outputs can substitute for the works at issue.
Defendants, for their part, argue that overbroad discovery chills innovation, threatens trade secrets, and risks exposing sensitive security practices.
The uneasy truth: fair use is the doctrine, but discovery supplies the facts that determine how doctrine applies.
Practical takeaways: what to watch as these cases unfold
For creators and publishers
- Outputs may matter as much as inputs. If market harm is the claim, log retention and output testing become critical.
- Consolidation changes leverage. MDL proceedings can streamline plaintiffs’ efforts but also standardize restrictive protocols.
For AI companies and product teams
- Retention policies are legal strategy. How long logs exist—and why—may become contested evidence.
- Protective orders are not a shield against burden. Even “secure room” review imposes operational costs.
For policymakers and the public
- Discovery disputes hint at future regulation. If courts repeatedly struggle to evaluate training and outputs without access, lawmakers may consider standardized audit or documentation frameworks.
What to watch in future filings
- ✓Motions to compel training datasets and pipeline documentation
- ✓Protective-order terms (AEO, secure-room review, copying limits)
- ✓Preservation fights over user chat logs and output evidence
- ✓MDL-wide discovery protocols that become default “industry rules”
Conclusion: the real battle is over what can be known
Yet the deeper contest is epistemic. Who gets to know what happened inside the training pipeline? Who can test what the model outputs at scale? What evidence must be preserved, and for how long? Discovery determines the answers—or determines that the answers stay locked behind “Attorneys’ Eyes Only.”
In the coming years, many of the most consequential “AI transparency” decisions may arrive not as sweeping policy pronouncements, but as discovery orders: dates, protocols, retention windows, confidentiality designations. The public may see only the silhouettes. The parties will fight over the blueprints.
If you want to understand how these cases will end, follow the fair-use briefs. If you want to understand how they’ll be decided, follow the discovery.
Frequently Asked Questions
Why is discovery so important in AI copyright lawsuits?
Discovery is how plaintiffs obtain evidence about what data was used, how it was processed, and what the system produced. Without that evidence, plaintiffs argue they cannot prove copying, substantial similarity, or market harm. Defendants respond that the requests are burdensome and risk exposing trade secrets, especially because training corpora and pipelines are massive and proprietary.
What does “Highly Confidential – Attorneys’ Eyes Only” mean?
It’s a restrictive confidentiality designation used in litigation. Materials labeled this way can typically be reviewed only by certain lawyers and approved experts, not by the parties themselves. In AI training-data disputes, such designations can come with strict viewing rules—like secure-room inspection and limitations on copying—making access legally granted but practically constrained.
Did a court really order OpenAI to produce GPT‑4 training data?
According to Debevoise’s account of the Tremblay authors litigation, a federal judge in the Northern District of California ordered OpenAI on January 27, 2025 to produce a dataset used to train GPT‑4. OpenAI argued a narrower “20,000-word solution” would be a reasonable compromise, highlighting how contested the scope of production can be.
Why are user chat logs and output logs part of copyright discovery fights?
Plaintiffs may argue that outputs substitute for copyrighted works, causing market harm. Logs can help test how often users request copyrighted content and what the system returns. Defendants often argue that broad retention creates privacy and compliance risks. OpenAI has publicly discussed disputes over retention demands, including a referenced April–September 2025 window in the NYT matter.
What is an MDL, and why does it matter here?
An MDL (multidistrict litigation) centralizes related federal cases to coordinate pretrial proceedings, especially discovery. The JPML’s April 3, 2025 transfer order consolidating copyright actions involving OpenAI/Microsoft emphasized shared factual questions and “overlapping, complex, and voluminous discovery.” That means one judge’s discovery rulings can effectively set rules for many plaintiffs.
If training can be fair use, why do plaintiffs keep pushing for discovery?
Fair use depends on facts: how data was acquired, whether copies were retained, and whether outputs compete with originals. Even where training looks more defensible, plaintiffs may pursue evidence of pirated sourcing, retention practices, or substitution effects. Discovery is where those facts—if they exist—can be tested rather than assumed.















