All the News That’s Fit to Scrape

If you’re reading this post you likely know by now that the New York Times last week filed a massive copyright infringement lawsuit against OpenAI and Microsoft over the unlicensed use of Times content to train the GPT line of generative AI foundation models.

It’s tempting to view this as the Big One, the Battle of the Titans that will make it all the way to the Supreme Court for a definitive resolution of the most contentious question in the realm of AI and copyright. It’s the New York Times, after all, one of the premier names in journalism anywhere in the world, and one of the few publishers with the resources to take on the tech giants and pursue the case to the end.

There are reasons to temper those expectations, however.

The Times and OpenAI had been negotiating for months over a deal to license the Times archive for use in training GPT models. And there have been indications along the way that the paper wasn’t happy with how those discussions were progressing. OpenAI, meanwhile, struck a licensing deal with Axel Springer, publisher of Die Bild and one of the largest news publishers in Europe.

Sprinkled throughout the Times’ 69-page complaint, moreover, are references to the over-representation of Times content in GPT training datasets, as if to underscore the value of the Newspaper of Record’s content to OpenAI.

Those are all earmarks of the classic negotiation-by-litigation strategy. The goal is not necessarily to win in court but to force OpenAI into agreeing to a deal on the Times’ terms by raising the prospect of potentially billions in statutory and actual damages for copyright infringement.

The Times’ initial brief was accompanied by more than 100,000 pages of attachments and exhibits containing thousands of examples of near verbatim versions of Times articles generated by ChatGPT in response to prompts, just in case the defendants didn’t take the hint.

The TImes complaint also notably names Microsoft as a co-defendant, and it comes barely a month after OpenAI was rocked by internal turmoil that saw it abruptly fire, and then almost as abruptly re-hire, CEO Sam Altman. Microsoft, which has an unusual 49% non-voting stake in OpenAI, played a central role in that drama, first by immediately agreeing to hire Altman and his team after his initial defenestration and then helping orchestrate his ultimate reinstatement.

Since then, however, Microsoft’s relationship with OpenAI has attracted scrutiny from antitrust and financial regulators in both the U.S. and U.K., and the cloud-computing giant has been at pains to downplay its role in OpenAI’s governance or internal operations. In naming Microsoft as a defendant, and emphasizing its central role in the GPT training process, the Times may be hoping to capitalize on the fallout from the Altman saga by forcing the software giant into an awkward legal posture. By aligning itself with OpenAI in defending the litigation, Microsoft risks highlighting its close ties to the AI company just as it’s working to minimize those ties in the eyes of regulators.

That puts significant pressure on Microsoft to settle the case, lest discovery yield additional ammunition for regulators.

For its part, OpenAI doesn’t appear to be looking for a fight, either.

“We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from AI technology and new revenue models,” OpenAI said in a statement. “Our ongoing conversations with the New York Times have been productive and moving forward constructively, so we are surprised and disappointed with this development. We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.”

Should those efforts fail, however, and the lawsuit ends up going to court, it would not be a slam-dunk case for the Times. It was filed in Southern District Court in New York, putting it in the federal Second Circuit, where controlling law is the 2nd Circuit’s own rulings in the two Google Books cases. Those rulings held that copying millions of texts to create a searchable index was sufficiently transformative to qualify as a fair use of the copyrighted works.

Given that training a Large Language Model like GPT arguably involves less actual copying of the texts in its training data than Google Books engaged in, the Times will have to clear a high precedential bar to prevail. And, while OpenAI may not be looking for a fight, it has reason to be cautious about any de facto precedent it might set in any deal with the Times.

Times content, like that of Axel Springer, is particularly valuable for training LLMs, both because of its high, fact-checked quality and because its timeliness helps keep models up to speed on current events. But no generative AI model developer, and certainly not OpenAI, wants to concede — or even appear to concede — that licenses are required as a matter of copyright law to include publishers’ content in training datasets. That could lead to a flood of claims from publishers large and small, which could scare off investors wary of the liability, just as OpenAI is in discussions with Microsoft and others about new investments that could raise to AI company’s valuation close to $100 billion.

In comments to the U.S. Copyright Office for the USCO’s inquiry into AI and copyright and filed while still in discussions with the Times, OpenAI wrote, “OpenAI believes that the training of AI models qualifies as a fair use, falling squarely in line with established precedents recognizing that the use of copyrighted materials by technology innovators in transformative ways is entirely consistent with copyright law… The factual metadata and fundamental information that AI models learn from training data are not protected by copyright law… And when technical realities require that copyrighted works be reproduced in order to extract and learn from these unprotectable aspects of a work, courts have routinely found those reproductions to be permissible under the fair use doctrine.”

So, to avoid an enormously costly and bruising court battle, OpenAI must find a way to reach an agreement with the New York Times without conceding that it would be committing copyright infringement if it can’t.

All of that leaves the parties staring down a set of very complex, delicate, and now high-stakes calculations as neither can be completely confident of how the courts would rule on the legal issues raised in the dispute, but neither wants to give ground they might have been able to keep on a question of existential importance to the future of their respective industries. And all while every copyright owner, technology company and AI user, along with large swaths of the legal and financial professions, the academic world, and policymakers in the U.S. and elsewhere will be following every twist and turn in the story.

Welcome to 2024.

Get the latest RightsTech news and analysis delivered directly in your inbox every week
We respect your privacy.