News You Can’t Use

EXTRA The New York Times apparently has withdrawn its support for a nascent effort otherwise involving several leading news organizations aimed at filing one or more lawsuits against AI developers for scraping publishers’ archives without permission. Instead, the Paper of Record is reportedly considering unilateral litigation against OpenAI, after negotiations collapsed between the Times and the ChatGPT-developer over the terms of use of Times content to train the large language model (LLM) AI system.

Existence of the multi-publisher initiative was revealed by IAC chairman Barry Diller in an interview last month with Semafor’s Ben Smith. Semafor also reported the Times‘ withdrawal from the coalition.

Whether the Times decides to bring a lawsuit against OpenAI, or not, the Gray Lady took other steps last week to try to limit the unauthorized use of its content to train generative AI models. The paper issued an update to its terms of service purporting to prohibit the use of “robots, spiders, scripts, service, software or any manual or automatic device, tool, or process designed to data mine or scrape” its content, “including, but not limited to text, photographs, images, illustrations, designs, audio clips, video clips… metadata, data, or compilations.” The new terms further prohibit the use of its content for “any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system.”

The Times‘s new policy has echoes of recent steps taken by Reddit to cut off access to its Data API by AI developers looking to scrap its vast archive of user-created content. Per Reddit’s own update to its terms of use issued in June, “Except as expressly permitted by this section, no other rights or licenses are granted or implied, including any right to use User Content for other purposes, such as for training a machine learning or AI model, without the express permission of rightsholders in the applicable User Content.”

OpenAI, meanwhile, appears to be trying to head off any litigation (or worse yet regulation) without actually agreeing to stop hoovering up as much web content as it can get its hands on, or conceding any liability for doing so. After releasing a new version of its web crawler, GPTBot to expand its training dataset in preparation for a new iteration of its GPT foundation model, OpenAI quietly issued documentation on how to prevent its bot from scraping a website by modifying the site’s “robots.txt” file.

As word of the mod spread, publishers quickly scrambled to protect their sites before the bot could reach them, including CNN, Reuters and the Australian Broadcasting Corporation. But the message was clear: It’s up to publishers to opt-out if they don’t want their content used to train the next ChatGPT.

Or the next Google Bard. In a submission last month to the Australian government’s review of its AI regulatory framework and in an accompanying blog post, the search giant alluded to a planned update to its robots.txt standard to allow publishers to opt-out of AI training. As Google told the government in its submission, policymakers should promote “copyright systems that enable appropriate and fair use of copyrighted content to enable the training of AI models in Australia on a broad and diverse range of data, while supporting workable opt-outs for entities that prefer their data not to be trained in using AI systems.”

All sides of the debate over the use of publishers’ content to train generative AI models, then, appear to be converging on an opt-out model for managing such use, albeit for conflicting reasons. Publishers are moving to protect what they see as the value of their content and the integrity of their copyrights by denying access to unauthorized users. AI developers are looking to reinforce their view that what they’re doing does not infringe anyone’s copyright, and to ensure they retain access to any content whose publisher has not taken affirmative steps to prevent it.

Their respective positions look more like a temporary armistice than a formal end to hostilities, however. It’s clear from the statement of principles for AI put out by the News Media Alliance (of which the Times is a member), as well as those of the broader coalition of rights owners involved in the Human Artistry Campaign, that publishers want to entrench a system of licensing for AI training that puts the burden firmly on the user to secure permission before including a publisher’s content. But as discussed here in previous posts, and alluded to last month by the judge in one of the first infringement cases brought against StabilityAI, it’s not clear what — if any — exclusive authors’ right or rights is actually being exploited in the process of training large AI models.

The growing reliance on terms of use and API restrictions, then, represents something of a workaround that may not be robust enough to support a generalized licensing system.

AI developers’ position may be even more muddled. On the one hand, they maintain their use of copyrighted material to train their models falls squarely within four corners of fair use. But fair use is a limitation on the exclusive right of reproduction, so if you’re relying on fair use you are, ipso facto, conceding to reproducing the works. Yet on the other hand, they insist the training process does not actually involve reproducing the training material.

For now, both sides seem to be retreating to their respective opt-out bunkers. But the battle is still on.

Share this: