Deals Hint at Emerging Market for AI Training Data

For all the sturm und drang around the unlicensed use of copyrighted works to train AI models there sure seem to be a lot of licensing deals being discussed all of a sudden, for access to copyrighted works to train AI models. In just the past two weeks, OpenAI has signed licensing deals with Wall Street Journal publisher News Corp., Vox Media and The Atlantic for access to their archives for training, and we’ve seen reports that Meta, Alphabet and OpenAI are all in conversations with the Hollywood studios about licensing movie and television footage.

In the few months before that, we saw deals between OpenAI and the Financial Times, and Reddit sign with both Google and OpenAI. Photo agencies and archives have also been actively striking deals with AI companies over the past few months, including Shutterstock, Photobucket and Flickr.

What gives? Have AI companies suddenly gotten religion that they’re now willing to license what they previously just scraped?

More likely, incentives have simply changed, on both sides of the table. The flood of litigation brought by creators and rights owners against AI companies is one factor behind the seeming change or heart. But so, too, is a shift in ordinary market forces and conditions.

The biggest shift is the depletion of easy fodder for training. The internet has been all but scraped clean of the most easily accessible content. AI companies now need access to content that is behind paywalls and bot blockers, handing a degree of leverage to publishers.

Related to that depletion of scrapable content is a decreasing return on scale. For the first few iterations of models, simply ingesting more training data brought measurable improvement to their quality and capability. But there are signs that sheer scale no longer brings competitively meaningful improvement to models, or that the increase in scale required to make a difference is effectively unachievable.

Achieving meaningful improvements in the usefulness of models is now more sensitive to the quality of training data used. That has handed leverage to rights owners sitting on the choicest caches of content, as evident by the roster of respected mastheads represented in the deals so far, and the absence of more problematic brands.

On the flip side of that coin is a measure of price capitulation on the part of publishers, at least for now. As discussed in previous posts, the current dispute between the New York Times and OpenAI feels very much like a dispute over price rather than a refusal by either party to deal. But all publishers are anxious to see a market develop for access to their data for training and many leading brands are willing to sign deals now for relatively modest sums to prime that development.

As Reddit COO Jen Wong explained of the UGC platform’s initial agreements, “I’d say, these are midterm deals is how we think about them because it’s such a nascent and early market that we want to see how things unfold. So, not forever, but long enough to understand value.”

Another shift has been rights owners’ growing interest in leveraging AI technology for their own purposes. Anxious for access to the latest AI technology and expertise they may be more willing to trade access to their content to get it.

The Hollywood studios are particularly keen to leverage the technology.

“We are very focused on AI. The biggest problem with making films today is the expense,” Sony Pictures CEO Tony Vinciquerra told an investor event last week. “We will be looking at ways to…produce both films for theaters and television in a more efficient way, using AI primarily.”

Photo agencies are also trading content for technology. Shutterstock has developed its own AI image generator using OpenAI’s DALL-E technology but has also struck training deals with Meta, Google, Amazon and Apple. Getty Images and Adobe have also developed their own image generators trained on their own content but relying on technology provided by others.

There have been other shifts also helping drive the trend toward more selective use of content to train models. Greater regulatory scrutiny of how AI companies access and use data in training, including new mandatory disclosure rules regarding training datasets, is clearly a factor. So, too, has growing interest among potential enterprise customers in models that are more narrowly tuned to their particular use cases rather than relying on large, generically trained models.

While encouraging, the market that has emerged thus far is a narrow one. The licensing agreements reached to date have all been direct deals, between individual AI companies and individual rights owners, for access to discreet, more or less finite datasets. It does nothing for the broad swath of creators and rights owners too small to treat directly with large AI companies who might benefit more from the development of collective rights management structures.

But it’s a start.

Share this: