AI and the News: Deal, or No Deal?

Reddit, the self-anointed “front page of the internet,” sits atop a huge archive of original content. It contains more than a billion posts created by its 73 million average daily unique users self-organized into more than 100,000 interest-based communities, or subreddits, ranging from sports to politics, technology, pets, movies music & TV, health & nutrition, business, philosophy and home & garden. You name it, there’s likely to be a subreddit for it.

The scale and diversity of the Reddit archive, replete with uncounted links to all corners of the World Wide Web and made freely accessible via API, has long-been a highly valued resource for researchers, academics and developers building third-party applications for accessing Reddit communities. More recently, it has also eagerly been mined by developers of generative AI tools in need of large troves of natural language texts on which to train their models.

Last spring, in response to the new breed of large-scale commercial users of its API, including several major technology companies, Reddit decided to put up a toll booth. It changed its API policies to begin charging commercial enterprise users for access to its archive and imposed limits on the number of API calls such developers can make. Its new terms of service singled out AI training as one of those commercial purposes to which the new terms will apply.

Except as expressly permitted by this section, no other rights or licenses are granted or implied, including any right to use User Content for other purposes, such as for training a machine learning or AI model, without the express permission of rightsholders in the applicable User Content.

“Reddit’s corpus of data is really valuable,” CEO Steve Huffman said at the time. “We don’t need to give all of that value to some of the largest companies in the world for free.”

Last week, Reddit announced a deal with just such a commercial user, Google, under which the search giant will pay $60 million a year to access the Reddit archive for various purposes, including for AI training, marking the first agreement of its kind to be publicly revealed.

Reddit is not the only large publisher to have recently tightened access to its archive, however, in response to the rise of generative AI. The New York Times updated its terms of service in August to prohibit the use of its content to develop “any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.” It also warned that the use of bots to crawl and scrape its website without written permission from the paper could result in fines of other unspecified penalties.

The Times subsequently entered into protracted discussions with OpenAI about a deal for access. Those talks broke down, however, and the relationship devolved into the Times’ massive copyright infringement lawsuit against the ChatGPT maker.

The failure of those negotiations stands in sharp contrast to the successful outcome of Reddit’s engagement with Google. And that difference could prove revealing about future negotiations between publishers and AI developers over access to and the use of publishers’ content for training.

Without having been in the room, it’s impossible to know precisely what was said between the Times and OpenAI. But as discussed here in a previous post, all indications are that the dispute is fundamentally about price. The Times was demanding more than OpenAI was willing to pay. And what the Times was offering in exchange appears to have been simply non-exclusive access to its archive and a promise not to sue.

Reddit’s deal with Google, by contrast, is not a simple pay-for-play affair. While the $60 million a year Google reportedly will pay Reddit sounds like a lot of money, for a company of Google’s size it’s barely more than a rounding error, and it’s not even clear that it will be entirely in cash.

Notably, and likely not coincidentally, the deal was announced in the same week that Reddit registered with the SEC to pursue an initial public offering (IPO) of stock. According to its S-1 filing, Reddit has never turned a profit in its 18 years of operation including the 13 years since it was spun out as a standalone entity by former owner Advance Media, parent of Condé Nast, which had acquired Reddit in 2006. It posted a net loss of $91 million in 2023. Nearly all of its revenue is generated by adverting on its platform amid the intensely competitive market for digital ad dollars.

What Reddit is really getting from the Google deal, other than cash, is evidence of a monetizable asset it can tout to prospective shareholders independent of ad revenue. It also will get access to Vertex AI, Google’s cloud-based AI development platform. According to Google’s announcement of the deal, Reddit intends to use Vertex AI to enhance search and other capabilities on the Reddit platform.

What Google gets is efficient access to real-time, structured data via Reddit’s API without the need to crawl, scrape and normalize the data itself.

In other words, while Google is technically licensing Reddit’s content for use as AI training data, as many publishers are demanding, that license is embedded in a broader commercial agreement between the companies in which each party is getting something of value beyond the cash involved.

It’s a model of successful deal making. But it’s a model that may not be widely replicable beyond the small number of large publishers and AI developers with enough assets and interests to be viable counterparties in a broad commercial arrangement.

For most publishers, simply opening the cash window and waiting to take orders is not likely to be enough.

Share this: