Revealing Sources: The News on AI

For news publishers, AI can giveth, and AI can taketh away. On the latter side of the ledger, publishers are in a cold sweat over Google’s “Search Generative Experience,” (SGE) product, which the search giant has been testing for the past several months. The tool, trained in part on publishers’ content, uses AI to generate fulsome responses to users’ search queries, rather than merely providing links to websites where answers might be found.

Last week, the Arkansas-based publisher Helena World Chronicle filed a prospective class-action lawsuit against Google, accusing the search giant of anti-competitive practices and specifically citing Search Generative Experience.

According to a study by an internal team at The Atlantic, SGE responses would likely be sufficient to answer the searcher’s query without the need to click through to another web page about 75% of the time. News publishers in general receive about 40% of their online traffic from referrals, according to measurement firm SimilarWeb, by far the largest share of which comes from Google. Losing that traffic to AI-powered search tools could be devastating.

In the other column, some publishers have decided they might as well get with the program. Last week, Axel Springer, on of the largest publishers in Europe as well as owning several major properties in the U.S., including Politico and Business Insider, announced a global deal with OpenAI to allow ChatGPT to provide summaries of select AS content, including content otherwise residing behind a paywall.

OpenAI will pay Axel Springer for the use of its content under the deal, including its use in “advanced training” of ChatGPT.

“We want to explore the opportunities of AI empowered journalism – to bring quality, societal relevance and the business model of journalism to the next level,” Axel Springer CEO Mathias Döpfner said in the news release announcing the deal.

OpenAI has a similar, albeit more limited commercial relationship with the Associated Press.

That apparent tension, however — between publishers’ fear of losing readers and revenue to AI’s ability to quickly summarize their content on the one hand, and the prospect of getting paid for the the right to enable an AI to summarize their content on the other — may not be quite as tense as it seems. Or, at least it need not be.

One of the biggest hurdles confronting rights owners across the board in trying to devise a workable remuneration model for the use of their content by generative AI systems is the problem of provenance. A large language model like GPT is just that: a model of language. It is not, contrary to how many rights owners may perceive it to be, a model of texts, let alone of the expressive elements in texts. The output LLMs generate is entirely and exclusively a function of statistical probabilities derived from the models’ analysis of uncounted billions of examples of language-use, whether embodied in formal texts or captured from informal dialog or other format.

It is not possible, therefore, and never will be, to work backwards from any particular output to discover its source material because no such material ever resides in the model. There is no provenance to account for or to attribute.

Direct licensing such as Axel Springer’s deal with OpenAI gets around that problem by essentially waving it away. It’s a mutual agreement to ignore the lack of demonstrable provenance with respect to AS content for purposes of remuneration.

That sort of deal is only available to select rights owners, however. Axel Springer’s content is valuable to OpenAI because high-quality content is particularly useful for training a model to generate accurate responses. So, too, is content from other first-tier news publishers with global footprints. They, too, could presumably negotiate terms with OpenAI if they were willing.

For publishers without such leverage, however, ignoring the question of provenance could mean writing off the prospect of remuneration.

Ironically, however, marrying a generative AI model to a search engine, as with Google’s Search Generative Experience, could provide publishers with at least a one path to avoiding the provenance pitfall by leveraging the sophistication and efficiency of Google’s search algorithm.

SGE uses the same crawler as Google’s regular search engine: GoogleBot. While it’s not possible to know how, why or from where a generative AI model came up with a particular output, it is possible — in fact intentional — to know the results of a search. If a SGE-like product were designed to conduct a search in response to a user’s query, and then feed the results of that search into a generative model to produce a summary of those results, you would at least have a starting point to establish provenance for the purposes of attribution and remuneration.

Google, of course, like most AI companies, insists that anything it can search it can use to train its models, without license or consideration to any rights owner whose content it uses. And it may have the law, narrowly construed, on its side. So realizing the promise of what researchers have taken to calling retrieval-augmented generation (RAG) might still require a major boost from policymakers.

If that boost were to come, however, it could be a RAG to riches story for publishers.

Share this: