Is AI Training On News Content Fair Use?
Online news articles are an essential source of training data for generative AI language models. Should developers compensate publishers for this content or is AI training an example of 'fair use'?
Artificial intelligence (AI) is only as good as the data it is trained on — and in the case of generative AI, much of that data comes from online news content. OpenAI says it’s essential for development, which may be true, but is it fair? Publishers are split on the answer to that question, so their reactions range from collaborative to litigious.
We look at the increasingly complicated relationship between generative AI companies and the news sites used to provide AI training data.
How generative AI developers source and use news content
Training AI models require vast amounts of data, and in the case of generative AI, that means virtually everything on the internet. Thus far, companies like OpenAI and Microsoft have used content from across the web to train their large language models (LLMs) and provide answers to a wide variety of questions, often without providing attribution to the sources used.
“There are a number of factors driving concerns about the threat posed by AI,” according to the World Press Trends Outlook 2023-2024 from WAN-IFRA. “These include copyright and IP concerns, including remuneration or compensation from tech companies using media companies’ content to train AI tools and platforms.”
The WAN-IFRA report found that 79% of survey respondents said they were either “somewhat concerned” (59%) or “very concerned” (20%) about the threat of AI to their business. Still, training generative AI chatbots without the use of others’ copyrighted work may be impossible. As The Guardian reports, in a submission to the House of Lords communications and digital select committee, OpenAI argued that “Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.” Essentially, in order to answer questions about current affairs and recent history, AI needs access to original content created by news organizations.
Copyright infringement is just the tip of the iceberg for many news organizations. Gartner predicts traditional search engine volume will drop 25% by 2026 as users shift to AI chatbots and virtual agents for their answers. This could have major ramifications for search traffic, threatening the livelihood of the organizations providing the data the AI models need.
And news content is not the only dubious source for AI training data. It has recently been revealed that, in training their ChatGPT model, OpenAI made use of the audio transcribed from a million hours of YouTube video clips.
Protecting intellectual property: publishers take a legal stand against AI training
Publishers’ reactions to generative AI’s use of their content to train their algorithms have varied widely. The initial reaction of publishers — and others — seemed to be litigation. The New York Times is currently suing OpenAI and Microsoft for using its copyrighted work to train the AI models driving ChatGPT and Bing. The Washington Post reports, “It’s the latest, and some believe the strongest, in a bevy of active lawsuits alleging that various tech and artificial intelligence companies have violated the intellectual property of media companies, photography sites, book authors and artists.”
What exactly is 'fair use'?
Everyone from visual artists to major news media has launched grievances against generative AI, and most of the claims hinge on the idea of “fair use.” Stanford defines fair use as “any copying of copyrighted material done for a limited and “transformative” purpose, such as to comment upon, criticize, or parody a copyrighted work. Such uses can be done without permission from the copyright owner. In other words, fair use is a defense against a claim of copyright infringement.”
When you listen to a podcast and they play a clip of a movie or a song before discussing it, that is a clear-cut example of acceptable “fair use.” As the Washington Post points out, generative AI exists in a strange, liminal space: “Broadly speaking, copyright law distinguishes between ripping off someone else’s work verbatim — which is generally illegal — and ‘remixing’ or putting it to a new, creative use. What is confounding about AI systems, said James Grimmelmann, a professor of digital and information law at Cornell University, is that in this case they seem to be doing both.”
While the courts determine whether using news sites’ content to train AI chatbots is considered fair use, some organizations take a different approach.
An alternative to litigation: agreements between publishers and AI developers
Not all publishers have elected to fight the AI bots. The New York Times reported, “But after the lawsuit was filed, those companies [OpenAI and Microsoft] noted that they were in discussions with a number of news organizations on using their content — and, in the case of OpenAI, had begun to sign deals.”
For example, The Associated Press (AP) made a deal in July of 2023 to let OpenAI license the AP’s archives of news stories to train ChatGPT.
In May 2024 News Corp signed an agreement with OpenAI to allow ChatGPT to use its content in replying to users "after a certain delay" in a deal said to be worth $250 million over 5 years.
Axel Springer has inked a similar deal with OpenAI. According to Forbes, “ChatGPT will soon summarize news articles from Politico, Business Insider and other Axel Springer-owned publications—and could include content otherwise available only to paid subscribers —in an unprecedented new agreement that could shape the future of journalism’s relationship with artificial intelligence.”
This deal requires ChatGPT to provide attribution and links to full stories — many of which will be behind a paywall. Whether the deal works to promote new traffic and subscriptions remains to be seen. It’s just one of the many unresolved questions facing AI and the news industry.
A legislative solution?
Ultimately, the government will likely have to intervene to decide whether the ability to train AI models is more important than the copyright — and survival — of news organizations. In some locales, governments will likely side with the media — as they have in cases against social media companies — and require AI companies to at least compensate companies for their work.