Facing a potentially ruinous lawsuit from the New York Times over the unlicensed use of the newspaper’s reporting to train its GPT Large Language Model, OpenAI is putting out the word that it is not opposed to paying publishers for access to their content, as it recently did with Axel Springer.
“We are in the middle of many negotiations and discussions with many publishers. They are active. They are very positive,” Tom Rubin, OpenAI’s chief of intellectual property and content, told Bloomberg News. “You’ve seen deals announced, and there will be more in the future.”
It’s apparently just opposed to paying very much. According to reporting by The Information, OpenAI has been offering publishers between $1 million and $5 million a year to license their copyrighted new articles.
The Times was no doubt looking for much more than token payments. And OpenAI might even have been willing to up its offer a bit for the Times’s prestigious content. But the fact the two sides were unable to close the gap is but another indication that their dispute is as much about price as it is about copyright principles.
Yet it could also be an indication of what might be an unbridgeable gap in how AI companies and publishers (as well as other rights owners) value any particular tranche of content.
Once upon a time, the expensive, labor-intensive process of gathering and reporting the news that the Times and other publishers undertook, was primarily subsidized by high-margin advertising sales. While unfettered independent news reporting plays a vital role in democracy, for a purely business perspective, the reporting that process yielded was valuable primarily for the advertiser-favored demographics of the readership it attracted. Some of those readers paid to subscribe to their favorite newspaper; others purchased single copies from newsstands. For the most part, though, they were paying for the cost of delivery, not for the content per se.
These days, that print-era economic model has evaporated, done in by the triumph of low-margin, programmatic digital advertising and the Google-Meta duopoly on the sell-side of ad placement.
It’s taken more than a decade, but some publishers at least have begun to figure out how to make their content pay more of the freight for its production via digital subscriptions, while the marginal cost of delivery is now effectively zero.
The Times, for instance, had 9.4 million digital-only subscribers worldwide as of the end of Q3 2023, far more than it ever had in its print-only, big-city heyday and far more than the 643,000 print subs it still has.
Rather than mostly a cost center, in other words, the Times’ content is now literally its bread-and-butter. It’s what its readers are paying for. The idea that anyone should be allowed to simply scrape it all up and mash it into a generative AI soup that can then substitute for the real thing is both ideologically anathema and economically disastrous for publishers. And five million bucks a year just ain’t gonna cut it.
What that perspective doesn’t reckon with, however, is the gargantuan amount of content a LLM model like GPT needs to be fed to learn how syntax, grammar and vocabulary work. OpenAI’s GPT-4, for instance, ingested something on the order of 13 trillion tokens, each token roughly the equivalent of a word, or part of a word in its training data, comprising something like 45 Gigabytes of data. That’s more words than the Times has ever published.
While Times content may be marginally more valuable to OpenAI for training than that of other publishers due to its high quality and broad range of subjects, the sheer amount of data its models are trained on means no single rights owner’s content is so essential that a LLM couldn’t adequately be trained without it. And the idea of paying big bucks for all of it is a non-starter with AI companies.
News and news reporting undeniably have value. They have democratic and cultural value, and they have economic value. Their economic value is established and protected by copyright. Their democratic and cultural value have no such statutory or institutional support, and is thus harder to quantify. To the Times, their value rests on preserving the institutional infrastructure for news reporting and production, however configured. That doesn’t mean that value should not be reckoned in the market price for copyrighted news content, however, at least not in the view of the Times.
Computers work on absolutes, however. They need numbers to perform their calculations, like the strings of numbers OpenAI’s computers assign to the word tokens they encode. Abstract quantities like the democratic and cultural value of news do not compute. From OpenAI’s point of view, it makes no sense for it to pay for what it can’t compute.
That fundamental difference in what news content is “worth” to either side will be a challenge to resolve.