Merriam-Webster and Britannica Sue OpenAI for 'Cannibalizing' Dictionary Data
Key Takeaways
- Merriam-Webster and its parent company Britannica have filed a lawsuit against OpenAI, alleging that ChatGPT was trained on their proprietary reference material without permission.
- The plaintiffs argue that the AI's ability to provide instant definitions has decimated their web traffic and threatens the economic viability of traditional lexicography.
Key Intelligence
Key Facts
- 1Lawsuit filed on March 17, 2026, by Merriam-Webster and Britannica against OpenAI.
- 2Plaintiffs allege ChatGPT was trained on proprietary definitions without a licensing agreement.
- 3The complaint highlights 'cannibalization' of web traffic as a primary economic harm.
- 4Merriam-Webster argues that its unique arrangement and expression of facts are copyright-protected.
- 5The suit follows similar high-profile legal actions from the New York Times and the Authors Guild.
Who's Affected
Analysis
The legal battleground for generative AI has expanded from the realms of creative literature and visual arts into the foundational world of reference data. Merriam-Webster and its parent company, Britannica, filed a lawsuit on March 17, 2026, against OpenAI, claiming that the AI giant’s flagship product, ChatGPT, was built using vast quantities of their copyrighted dictionary and encyclopedia entries without authorization. This case represents a significant escalation in the ongoing conflict between content owners and AI developers, shifting the focus to the highly structured, factual data that gives Large Language Models (LLMs) their linguistic precision and semantic depth.
At the heart of the complaint is the allegation that OpenAI’s training process involved scraping millions of definitions, etymologies, and usage examples that Merriam-Webster and Britannica have spent decades—and in some cases, centuries—curating. While facts themselves are generally not copyrightable under U.S. law, the specific expression, arrangement, and comprehensive compilation of those facts in a dictionary are protected. The plaintiffs argue that ChatGPT does not merely provide information but replicates the unique voice and structural expertise of their reference works, effectively creating a derivative product that competes directly with the original sources.
At the heart of the complaint is the allegation that OpenAI’s training process involved scraping millions of definitions, etymologies, and usage examples that Merriam-Webster and Britannica have spent decades—and in some cases, centuries—curating.
The economic argument presented by the publishers is particularly compelling: the cannibalization of web traffic. For decades, the business model for digital dictionaries has relied on search engine traffic leading users to their websites, where ad revenue and premium subscriptions sustain the costly work of lexicography. By providing instant, high-quality definitions within its own interface, ChatGPT bypasses the need for users to visit external reference sites. Merriam-Webster and Britannica contend that this is not a case of transformative fair use, but rather a parasitic relationship where the AI uses the publishers' own data to render their primary distribution channels obsolete.
This lawsuit follows a pattern of litigation from other content-heavy industries, including the New York Times and various authors' guilds. However, the Merriam-Webster case is unique because it targets the very building blocks of language that AI models require to function. If the courts side with the publishers, it could force a radical shift in how AI companies source their training data. We may see the emergence of a mandatory licensing regime for reference data, similar to how music streaming services pay royalties to labels and artists. For OpenAI, which has recently sought to strike licensing deals with news organizations like Axel Springer and the Associated Press, this lawsuit suggests that the fair use defense is becoming increasingly difficult to maintain as a blanket strategy.
What to Watch
Industry analysts suggest that the outcome of this case will hinge on whether the court views ChatGPT’s output as a substitute for a dictionary. If a user asks for a definition and receives a response that is substantially similar to a Merriam-Webster entry, the claim of market harm becomes much stronger. Conversely, OpenAI is expected to argue that its models learn the patterns of language rather than storing specific entries, and that the resulting definitions are synthesized on the fly. This technical distinction will be a central point of contention as the discovery process begins.
Looking ahead, the resolution of this dispute will likely dictate the future of the Open Web. If reference giants like Britannica can successfully gate their content behind paywalls or licensing fees, the era of free, high-quality information being easily accessible to AI scrapers may be coming to an end. This could lead to a bifurcated AI landscape where only the wealthiest tech companies can afford to train models on verified, high-authority data, while smaller players are left with lower-quality, unverified web-scraped content. The case is a stark reminder that in the AI economy, data is not just the new oil—it is the sovereign territory of the institutions that created it.
Timeline
Timeline
ChatGPT Launch
OpenAI releases ChatGPT, utilizing vast amounts of web-scraped data for training.
NYT Lawsuit
The New York Times sues OpenAI for copyright infringement, setting a legal precedent.
Dictionary Suit Filed
Merriam-Webster and Britannica file a joint lawsuit alleging data theft and traffic loss.
From the Network
Merriam-Webster and Britannica Sue OpenAI Over AI Training Data Theft
Merriam-Webster and Britannica have filed a joint lawsuit against OpenAI, alleging the tech giant used their proprietary definitions and encyclopedic content to train ChatGPT without authorization. Th
StartupsMerriam-Webster and Britannica Sue OpenAI Over AI Training 'Theft'
Merriam-Webster and Britannica have filed a lawsuit against OpenAI, alleging that ChatGPT was trained on their proprietary reference material without authorization. The plaintiffs claim the AI system
How we covered this story
Every story in our ai coverage is assembled from multiple primary sources, cross-referenced for factual consistency, and scored along three independent dimensions: sentiment, operational impact, and source-cluster confidence. Single-source rumors and unverifiable claims do not pass our editorial gate. When a story shows "Verified by N sources" with N≥2, the development is independently corroborated; when N=1, we mark it explicitly so readers can weigh the signal accordingly.
Impact scoring uses a 1-10 scale weighted toward regulatory, financial, and operational consequence rather than coverage volume. A topic that runs in every outlet but moves no real decisions ranks lower than a niche regulatory filing that reshapes how operators in the ai space have to behave. Read our full methodology for the scoring rubric, our glossary for term definitions, and our trends index for the longitudinal view across the beat.
| Signal on this page | What it tells you |
|---|---|
| Verified by N sources | Independent corroboration count. N≥2 is our confidence floor; N=1 is marked explicitly. |
| Impact score (1-10) | Regulatory + financial + operational weight. 8+ signals an experienced-operator action item. |
| Sentiment | Five-tier classification trained on labeled ai-specific corpora. |
| Timeline | Where applicable, the related-events sequence that contextualizes today's development. |