AI Training Lawsuits Aren’t Really About “Fair Use” Anymore — They’re a Discovery War Over the One Dataset You’re Not Allowed to See

Q: Did a court really order OpenAI to produce GPT‑4 training data?

According to Debevoise’s account of the *Tremblay* authors litigation, a federal judge in the Northern District of California ordered OpenAI on **January 27, 2025** to produce a dataset used to train GPT‑4. OpenAI argued a narrower “20,000-word solution” would be a reasonable compromise, highlighting how contested the scope of production can be.

“Fair use” drives the headlines, but discovery drives leverage. The real fight is over what must be preserved, produced, and explained—then locked behind Attorneys’ Eyes Only.

By TheMurrow Editorial

March 29, 2026

AI Training Lawsuits Aren’t Really About “Fair Use” Anymore — They’re a Discovery War Over the One Dataset You’re Not Allowed to See

Key Points

1Track discovery, not just fair use: motions to compel datasets, retention logs, and pipeline documentation increasingly decide leverage and proof.
2Expect secrecy even after “wins”: Highly Confidential – Attorneys’ Eyes Only and secure-room protocols can make produced datasets functionally unseen.
3Watch MDL spillover: the April 3, 2025 consolidation can standardize discovery rules that scale across many plaintiffs and shape the industry dispute.

For years, the public story of AI copyright litigation has been a single, magnetic phrase: fair use. It’s the headline hook, the cable-news shorthand, the question people ask at dinner. Can a model learn from copyrighted work without permission?

Inside the courthouse, though, the fights that decide momentum often sound less philosophical and more procedural. They turn on what a company must preserve, produce, and explain about what went into a model—and what the model did after it shipped.

That shift matters because it changes who holds leverage. Plaintiffs argue they cannot prove copying, substantial similarity, or market harm without seeing what was ingested and how. Defendants argue plaintiffs are trying to force disclosure of the “secret sauce”: enormous training corpora, internal pipelines, and proprietary decision-making that are costly and risky to expose.

“In AI copyright cases, ‘fair use’ gets the headlines. Discovery gets the leverage.”
— — TheMurrow

The result is an unusual kind of visibility. Courts can order production of a training dataset and still keep it effectively out of reach—sealed behind protective orders, available only under strict protocols, and often restricted to a narrow set of attorneys. The most important evidence in the case becomes legally visible yet operationally opaque.

The center of gravity moves from “fair use” to discovery

“Fair use” remains the doctrinal prize in U.S. AI training suits, and defendants will keep aiming for early rulings. Yet day-to-day litigation is increasingly shaped by discovery disputes: what must be turned over, in what format, and under what safeguards.

Debevoise’s analysis of recent disputes captures the structural tension. U.S. discovery is broad, and plaintiffs insist they need real evidence of what the model maker did—training data sources, retention practices, internal evaluations, and more. Defendants counter that training corpora are enormous, disclosure is burdensome, and producing data at scale risks revealing trade secrets. The burden argument isn’t theatrical; training datasets can be so large that even defining “the dataset” becomes a technical and legal question.

Why discovery is where cases are won or lost

Discovery is not a side quest; it dictates what either side can prove.

Plaintiffs’ core problem is informational. Without access to the data pipeline, it’s difficult to show:

- whether copyrighted works were copied into a training corpus,
- how those copies were stored or retained,
- whether outputs can substitute for originals in a way that drives market harm.

Defendants’ core problem is exposure. Even if a company believes training is lawful, broad production can reveal:

- proprietary curation methods,
- internal risk assessments,
- the identities and composition of sources.

The MDL effect: one ruling can set the rules for many cases

On April 3, 2025, the U.S. Judicial Panel on Multidistrict Litigation centralized numerous copyright actions involving OpenAI/Microsoft into an MDL, emphasizing “overlapping, complex, and voluminous discovery” about training and design. The procedural detail is the point: consolidation makes discovery the main event.

Key statistic #1: The MDL Transfer Order is dated April 3, 2025—a concrete marker that the court system expects discovery to be sprawling and shared across cases.

April 3, 2025

The JPML MDL Transfer Order date—signaling courts expect “overlapping, complex, and voluminous discovery” about training and design.

“In an MDL, a single discovery protocol can become the de facto rulebook for an entire industry dispute.”
— — TheMurrow

The “one dataset you’re not allowed to see”

When courts order production of training-related materials, they often pair that production with protective orders that sharply limit access. In practice, a party can “win” discovery and still end up with only controlled glimpses—valuable in litigation, nearly invisible to the public.

Reporting on OpenAI-related litigation describes discovery protocols that read like a high-security exam. Inspection may occur in a secure room on a locked-down computer with no internet access. Limits can apply to devices, copying, and who can participate. The designation that drives this is familiar to litigators and alien to most readers: Highly Confidential – Attorneys’ Eyes Only.

What “Attorneys’ Eyes Only” really means

Under an “AEO” designation, the people most affected by a technology—authors, journalists, artists—may not be allowed to see the very evidence brought on their behalf. Access can be restricted to outside counsel and a small set of approved experts.

That restriction is not mere legal theater. Defendants argue that training data and curation logic are trade secrets. Courts often accept that premise while still requiring enough production to allow the case to proceed.

The upshot is an accountability paradox. The court can compel disclosure, yet the public record remains thin, and even the plaintiffs themselves may rely on filtered summaries from their legal teams.

Why this matters beyond the courtroom

These confidentiality structures shape the broader policy debate. Legislators, regulators, and the public may want clarity on what data was used and under what conditions. Litigation is one of the few mechanisms that can force answers—but the answers may be sealed, restricted, or viewable only under controlled conditions.

Practical implication: as AI cases progress, expect fewer definitive public disclosures than the headlines suggest, even when plaintiffs “win” motions to compel.

Key Insight

Court-ordered production can coexist with operational opacity: materials become “legally visible” while remaining sealed, restricted, and functionally unseen by the public.

The GPT‑4 training dataset order—and what it signals

One decision has become a reference point for this new era of AI discovery: a federal judge in the Northern District of California ordering OpenAI to produce a dataset used to train GPT‑4 in the Tremblay authors litigation. Debevoise describes the order as issued on January 27, 2025.

Key statistic #2: The order date—January 27, 2025—matters because it marks a moment when a court required production of core training materials rather than accepting narrower substitutes.

OpenAI argued for a compromise approach, proposing what Debevoise describes as a limited “20,000-word solution” rather than full production. The dispute reflects a recurring pattern: defendants offer partial disclosures, plaintiffs argue partials won’t do.

January 27, 2025

Debevoise reports this is when a federal judge ordered production of a dataset used to train GPT‑4 in the Tremblay authors case.

Why “produce the dataset” is not a simple instruction

Even under a court order, “the dataset” can be contested. Models may be trained on mixtures of sources, transformations, deduplications, filters, and versions. Producing a dataset can mean producing:

- raw source lists,
- processed corpora,
- logs showing ingestion and filtering,
- documentation of how components were assembled.

Courts have to translate technical systems into discoverable categories. Defendants often frame full production as burdensome and risky. Plaintiffs frame it as the only path to proving what happened.

The public lesson: a big order doesn’t guarantee transparency

A court can compel production and still impose restrictions that keep the materials far from public view. The practical impact shows up in the litigation timeline: lawyers and experts review, test, and argue—while public debate continues with incomplete information.

“A dataset can be ‘produced’ and still remain functionally unseen.”
— — TheMurrow

The second dataset war: user chat logs and output evidence

Training data isn’t the only battleground. A separate, equally consequential fight centers on user chat logs and other output-related records—evidence that could speak to how models behave in the wild and whether they substitute for copyrighted works.

In publisher litigation, including The New York Times v. OpenAI/Microsoft, preservation and retention have become flashpoints. OpenAI has publicly stated it was “no longer under a legal order to retain consumer ChatGPT and API content indefinitely,” and characterized The New York Times’ demand as seeking retention for a specific window—OpenAI cites April–September 2025.

Key statistic #3: The contested retention window described by OpenAI spans April–September 2025.

April–September 2025

OpenAI cites this as the contested retention window in disputes over preserving ChatGPT/API logs in the NYT litigation context.

Why outputs matter to market-harm arguments

Even if a court accepts that training can be fair use in the abstract, plaintiffs may still pursue claims tied to outputs and substitution. Logs and retention data can be used to test questions such as:

- Did users prompt models for passages that resemble copyrighted text?
- How often did the model respond with content that competes with the original?
- Were there guardrails, and did they work?

Plaintiffs often argue that without logs, the defendant can deny harm while holding the best evidence. Defendants reply that retaining large volumes of user content creates privacy, security, and compliance risks.

Preservation is not neutral—it shapes what can be proven

Loeb’s discussion of consolidated OpenAI copyright litigation highlights disputes over deletions and preservation—courts and litigants wrestling with how to keep potentially relevant logs while addressing privacy constraints.

The tension is real. Broad preservation may protect evidence but expand the amount of sensitive user data held for longer than intended. Narrow preservation may reduce privacy exposure but limit what plaintiffs can prove.

Practical implication for readers: if you want to know whether AI tools are “replacing” certain kinds of content, the best empirical evidence may sit in retention and logging systems—and those systems are now litigation terrain.

Consolidation turns discovery into an industry-wide stress test

The MDL isn’t just a procedural convenience; it’s a pressure multiplier. When cases are centralized, discovery rulings scale.

The JPML Transfer Order dated April 3, 2025 explicitly points to shared factual questions and the prospect of “overlapping, complex, and voluminous discovery.” That is the court system acknowledging that the hard part is not a single legal question—it’s the fact-finding across technical systems and corporate practices.

Key statistic #4: The MDL transfer consolidates “numerous” actions into one coordinated proceeding, elevating the impact of any discovery order across plaintiffs and claims. (The order emphasizes the discovery volume and complexity as a principal justification.)

MDL consolidation

Centralizes “numerous” OpenAI/Microsoft copyright actions—so one judge’s discovery rulings can scale across many plaintiffs and claims.

What defendants fear in MDL discovery

From a company perspective, consolidation can increase the risk that:

- one plaintiff’s discovery theory becomes everyone’s,
- one protocol exposes a broader slice of proprietary infrastructure,
- one adverse preservation ruling increases operational burdens across products.

What plaintiffs gain—and what they still can’t get

For plaintiffs, consolidation can reduce duplicative battles and prevent inconsistent rulings. Yet the same confidentiality regime that limits public insight can persist at scale. “AEO” designations travel well; transparency does not.

The larger implication is sobering: the most significant public debate about AI training may be litigated largely in private, with critical evidence reviewed behind sealed doors.

Accountability Paradox

Courts can compel disclosure of core training materials while simultaneously restricting access so tightly that public understanding barely improves.

Fair use still matters—and still depends on facts plaintiffs seek in discovery

Discovery’s rise doesn’t mean fair use has disappeared. A mid‑2025 decision in the Anthropic book-training case illustrates the point. Judge William Alsup held that training Anthropic’s AI models on books could be fair use in that case, while also indicating that building or storing a searchable repository of pirated books was not treated the same way, according to reporting.

The ruling shows why discovery remains central even when the legal doctrine appears to favor model developers. Fair use is famously fact-dependent. Plaintiffs will try to show facts that push a court toward skepticism: improper acquisition, retention of copies, or outputs that compete with originals.

The facts that can tilt a fair-use analysis

Even with a favorable training ruling, plaintiffs pursue discovery to probe:

- how data was acquired (licensed, scraped, or sourced from “shadow libraries”),
- whether copies were retained in ways that go beyond transient processing,
- whether internal documents suggest awareness of risk or willfulness,
- whether the system’s outputs can substitute for the works at issue.

Defendants, for their part, argue that overbroad discovery chills innovation, threatens trade secrets, and risks exposing sensitive security practices.

The uneasy truth: fair use is the doctrine, but discovery supplies the facts that determine how doctrine applies.

Practical takeaways: what to watch as these cases unfold

Readers don’t need to memorize civil procedure to understand what’s at stake. A few concrete signals will tell you where these disputes are heading.

For creators and publishers

- Expect partial visibility. Even when courts order production, “Highly Confidential – Attorneys’ Eyes Only” controls who can see what.
- Outputs may matter as much as inputs. If market harm is the claim, log retention and output testing become critical.
- Consolidation changes leverage. MDL proceedings can streamline plaintiffs’ efforts but also standardize restrictive protocols.

For AI companies and product teams

- Treat data provenance as litigation-grade. Discovery requests will probe acquisition pathways and documentation.
- Retention policies are legal strategy. How long logs exist—and why—may become contested evidence.
- Protective orders are not a shield against burden. Even “secure room” review imposes operational costs.

For policymakers and the public

- Litigation may not produce public transparency. Court-ordered production can coexist with minimal public disclosure.
- Discovery disputes hint at future regulation. If courts repeatedly struggle to evaluate training and outputs without access, lawmakers may consider standardized audit or documentation frameworks.

What to watch in future filings

✓Motions to compel training datasets and pipeline documentation
✓Protective-order terms (AEO, secure-room review, copying limits)
✓Preservation fights over user chat logs and output evidence
✓MDL-wide discovery protocols that become default “industry rules”

Conclusion: the real battle is over what can be known

The temptation is to treat AI copyright litigation as a referendum on a single phrase—fair use. Courts will keep addressing that question, and individual rulings will swing sentiment.

Yet the deeper contest is epistemic. Who gets to know what happened inside the training pipeline? Who can test what the model outputs at scale? What evidence must be preserved, and for how long? Discovery determines the answers—or determines that the answers stay locked behind “Attorneys’ Eyes Only.”

In the coming years, many of the most consequential “AI transparency” decisions may arrive not as sweeping policy pronouncements, but as discovery orders: dates, protocols, retention windows, confidentiality designations. The public may see only the silhouettes. The parties will fight over the blueprints.

If you want to understand how these cases will end, follow the fair-use briefs. If you want to understand how they’ll be decided, follow the discovery.

About the Author

TheMurrow Editorial is a writer for TheMurrow covering explainers.

Frequently Asked Questions

Why is discovery so important in AI copyright lawsuits?

Discovery is how plaintiffs obtain evidence about what data was used, how it was processed, and what the system produced. Without that evidence, plaintiffs argue they cannot prove copying, substantial similarity, or market harm. Defendants respond that the requests are burdensome and risk exposing trade secrets, especially because training corpora and pipelines are massive and proprietary.

What does “Highly Confidential – Attorneys’ Eyes Only” mean?

It’s a restrictive confidentiality designation used in litigation. Materials labeled this way can typically be reviewed only by certain lawyers and approved experts, not by the parties themselves. In AI training-data disputes, such designations can come with strict viewing rules—like secure-room inspection and limitations on copying—making access legally granted but practically constrained.

Did a court really order OpenAI to produce GPT‑4 training data?

According to Debevoise’s account of the Tremblay authors litigation, a federal judge in the Northern District of California ordered OpenAI on January 27, 2025 to produce a dataset used to train GPT‑4. OpenAI argued a narrower “20,000-word solution” would be a reasonable compromise, highlighting how contested the scope of production can be.

Why are user chat logs and output logs part of copyright discovery fights?

Plaintiffs may argue that outputs substitute for copyrighted works, causing market harm. Logs can help test how often users request copyrighted content and what the system returns. Defendants often argue that broad retention creates privacy and compliance risks. OpenAI has publicly discussed disputes over retention demands, including a referenced April–September 2025 window in the NYT matter.

What is an MDL, and why does it matter here?

An MDL (multidistrict litigation) centralizes related federal cases to coordinate pretrial proceedings, especially discovery. The JPML’s April 3, 2025 transfer order consolidating copyright actions involving OpenAI/Microsoft emphasized shared factual questions and “overlapping, complex, and voluminous discovery.” That means one judge’s discovery rulings can effectively set rules for many plaintiffs.

If training can be fair use, why do plaintiffs keep pushing for discovery?

Fair use depends on facts: how data was acquired, whether copies were retained, and whether outputs compete with originals. Even where training looks more defensible, plaintiffs may pursue evidence of pirated sourcing, retention practices, or substitution effects. Discovery is where those facts—if they exist—can be tested rather than assumed.

More in Explainers

Explainers·May 17

AI Agents Are Becoming Your Middleman—But Here’s the 2-Line Web ‘Handshake’ That Determines Whether They Can Buy, Book, or Break Things

A plain-text file at your domain root still decides what many automated systems can reach—just as agents shift from reading pages to taking actions. The catch: it’s a handshake, not a lock.

Explainers·May 7

Apple’s ‘Encrypted RCS’ Fix Is Real—So Why Are Your “Green Bubble” Texts Still Less Private (and sometimes less reliable) than you think?

Apple says iOS 26.5 brings end‑to‑end encrypted RCS—but it’s beta, carrier‑gated, and threads can still downgrade to SMS/MMS. The color never promised privacy.

Explainers·May 4

America’s $800 ‘Duty‑Free’ Rule Is Collapsing in 2026—Here’s the Shipping Trick That Quietly Kept Your Shein/Temu Hauls Cheap (and what replaces it)

That “price magic” wasn’t logistics—it was Section 321 de minimis. EO 14324 flips the duty‑free switch off for most shipments, changing checkout totals, clearance, and fulfillment strategy.

Explainers·Apr 29

The Age-Verification Trick Lawmakers Aren’t Saying Out Loud: ‘Protect the Kids’ Bills That Turn Your Phone Into an ID Scanner (Even If You Don’t Have Kids)

The laws aren’t just targeting porn sites or social apps anymore—they’re targeting the chokepoints: app stores and even operating systems. To identify minors, the system has to process everyone, building a durable age/ID layer into everyday phone use.

Explainers·Apr 25

Your 2026 A/C Isn’t Being ‘Phased Out’—It’s Being Reclassified as a Fire Risk (and That’s Why Quotes Are Jumping by 20–40%)

The EPA’s shift is climate policy—GWP limits for new equipment—not a recall of what you already own. But the replacement refrigerants are often A2L “mildly flammable,” and that’s what’s changing codes, installs, labels, and prices.

Explainers·Apr 7

Half of America’s ‘AI Data Centers’ Aren’t Getting Built—So Why Are Your Electric Bills Still Rising? The Interconnection-Queue Trick Utilities Won’t Stop Using

Utilities are treating massive AI-related load requests like inevitable demand—even when many entries are duplicative, speculative, or never built. That paperwork can still steer billions in grid upgrades and show up in your rates before a single server rack turns on.

Explainers·Mar 13

California’s One-Click Data-Deletion Tool Goes Live Aug. 1, 2026—So Why Might Your Data Spread Faster After You Click?

California’s DROP portal lets residents broadcast one deletion request to every registered data broker—but processing starts later, runs in cycles, and may require you to share more identifiers first.

Explainers·Mar 12

Your ‘AI Detection’ Tool Can Be 100% Right—and Still Lie: The New Proof That Provenance and Watermarks Can Contradict Each Other

A 2026 paper shows a cryptographically valid C2PA manifest and a highly reliable AI watermark can both “pass” yet imply incompatible stories. The result: verification that’s precise—but publicly misleading.

Sports·May 24

Pro Cycling Tried to Ban One Gear Combo—Then a Competition Court Said ‘No.’ Here’s Why a Bike Part Fight Could Decide the Next Wave of Safety Rules

A proposed UCI “54×11” maximum gearing trial was pitched as safety—but Belgian authorities said the process wasn’t transparent or proportionate, and it hit one supplier hardest. Now the sport’s next safety rules may depend on how they’re justified, staged, and enforced.

Health & Wellness·May 24

The FDA’s June 30 GLP-1 Deadline Isn’t About Weight Loss — It’s About ‘Copycat’ Chemistry (and why your injection may suddenly stop working)

June 30 isn’t a patient stop-date—it’s the close of an FDA public-comment window that could squeeze industrial compounding (503B) even as patient-specific compounding (503A) remains narrower, but not gone.

Travel·May 24

Your Face Is Becoming Your Boarding Pass—But Here’s the Part Nobody Tells You: You’re Still Re-Enrolling at Every Airport in 2026

Biometric lanes are real—but the U.S. built them as separate TSA, CBP, and airline systems. So the “one identity everywhere” promise still breaks the moment you change airports or carriers.

Style & Fashion·May 24

Europe’s July 19 Clothing Ban Sounds Like a Sustainability Win — So Why Are Brands Suddenly Obsessed With ‘Fit Tech’ and Smaller Returns?

The EU isn’t banning clothing—it’s banning the destruction of unsold apparel for large companies starting July 19, 2026. Once shredding is off the table, brands will chase the next biggest waste lever: fit-driven returns.

Business & Money·May 24

Stablecoins Aren’t ‘Digital Dollars’—They’re Short-Term Treasury Megafunds: The New Yield Loophole Banks Are Fighting (and why it could reshape your checking account by 2027)

USDC and USDT don’t run on piles of cash—they run on rolling T-bills and repo that generate real yield. The token stays at $1, but the portfolio underneath (and who captures the interest) is the real story.

World News·May 24

Bangladesh just passed 500 child deaths from measles — and the ‘contained’ outbreak is still spreading

The death toll’s headline number masks a crucial definitional split—lab-confirmed vs. “measles-like symptoms.” Meanwhile, WHO says 58 of 64 districts are affected, and emergency vaccination has escalated nationwide.

Opinion·May 24

Trump Says an Iran Deal Is Coming ‘Shortly.’ Here’s the Catch: A Hormuz ‘Victory’ Could Lock In $5 Gas for Months—and Make Washington Call It Peace

A ceasefire headline can move markets in hours, but safe, routine shipping through Hormuz is rebuilt on the water—via mine-clearing, insurance repricing, and proven transit. That lag is where $5 gas can stick even after Washington declares “peace.”

Reviews·May 23

Apple’s App Store Now Shows AI ‘Review Summaries’—Here’s the 3-Star Pattern They Can’t See (and the $9.99 Trap It Hides)

Apple is elevating an AI-written paragraph above the review pile—turning messy human feedback into a single, authoritative voice. That convenience can also smooth extremes, amplify manipulation, and quietly reshape what shoppers tolerate and what developers get blamed for.