One of the biggest challenges facing copyright owners in grappling with the rapid development of generative AI technology, apart from a murky legal status, has been market failure, as discussed here in previous posts. The amount of existing material needed to train gen-AI models is so great, and so varied, that gauging the value of any one piece of it to establish a market price for it for licensing purposes is often effectively impossible.
One group of rights owners is finding willing buyers among AI companies, however. An active, albeit for various reasons mostly sotto voce market has begun to emerge for the use of images held in large photo archives and by photo agencies and social media platforms, complete with per-unit industry and at least a nod toward creator attribution.
PetaPixel last week published a helpful list of nine owners of photo and video collections know or believed to have licensed their content to one or more AI companies for training purposes. Also last week, Bloomberg reported that Adobe has been paying third parties for image archives to supplement its own stock-photo collection to train its Firefly image generator. According to the report, Adobe is paying 6 cents and 16 cents per photo and an average of $2.62 per minute of video.
Among the archives making its photos available to AI companies, according to a report earlier this month by Reuters is Photobucket, once a leading image-hosting platform now eclipsed by Instagram and other services. While its registered users have fallen from 70 million at its peak to only 2 million today, the company holds roughly 19 billion photos. It told the news agency it has discussed rates with multiple AI companies of between 5 cents and $1.00 per photo and more than $1.00 per video.
Many of the deals have not been publicly disclosed. In some cases, the photo archives involved have not yet worked out whether and how to compensate the creators of the images being licensed. In other cases, the AI companies in question have not wanted to advertise any payments to rights owners while they’re fighting multiple lawsuits over the unlicensed use of copyrighted works. Yet their are commonalities among the deals that hold lessons for other rights owners hoping to establish licensing markets, as well as for courts and policymakers looking to encourage market formation.
- At a time when AI companies are increasingly desperate for more high-quality content to train their next generation models, the archives owned by photo agencies and image-hosting platforms include many images captured by professional photographers and skilled hobbyists, making them particularly valuable for training. Moreover, in some cases, as with Photobucket, large portions of the archives are no longer easily available online, making scraping difficult. In other cases, as with Meta, Shutterstock and Getty Images, platform owners plan to use their photo archives to train their own image generators and are unwilling to leave them open for unpaid scraping by third parties. For better or worse, depending on your perspective, that points to a future with more content moved behind paywalls and greater reliance on APIs to control access.
- The archives, though large, are finite and their content is enumerable, making it suitable for per-unit pricing. In the case of photo agencies and hosting platforms with registered users, moreover, the provenance of the images in the archive is traceable, their metadata is included, and their creators are generally identified or identifiable. That creates at least a foundation for compensating the creators of works that are licensed for use in AI training. It also highlights the critical importance of comprehensive and consistent metadata to any plausibly viable licensing market for AI training data.
- Knowledge is power. In most of the deals known to that have been done so far, the archive has at least disclosed or intends to disclose its licensing plans to contributors, even if only by a unilateral change to its terms of service. In some of those cases. In some of those cases, the archive is also allowing, or has plans to allow, contributors to opt-out of allowing their images to be included in the licensed datasets. The training disclosure requirement in the EU AI Act and the Generative AI Disclosure Act in the U.S., if enacted, could encourage a similar dynamic to emerge in other creative sectors.
- None of the licensing deals known or believed to have been done to date appears to be exclusive with any on AI company. Collectively, however, they have involved a fairly small set of known or suspected buyers. As noted here before, that has already raised eyebrows at the Federal Trade Commission and prompted at least one formal investigation into possible unfair trade practices, involving Reddit and one or more AI companies. As much as rights owners and many policymakers would like to see a licensing market develop round AI training data, the contours and conduct of that market is likely to draw further scrutiny from competition watchdogs in both the U.S. and EU.
Notably, the nascent market in photo archives for AI training has emerged before courts have had a chance to resolve the copyright controversies that have arisen around AI training. Many of the buyers and sellers engaging in licensing deals, in fact, are themselves involved in some of the most contentious of those legal disputes.
Resolution of those disputes, or changes to the law, will no doubt shape how such markets evolve in the future. But it was competitive pressures and existing business arrangements that provided the spark.