Generative AI Scales Up

EXTRA The Ides of March this year fell in the middle of what turned out to be another busy week on the generative AI front. And, while perhaps not holding quite the same dire portent as for Caesar, the fallout from the events around that date could prove dramatic nonetheless. On Tuesday, OpenAI released its Generative Pre-trained Transformer version four (GPT-4) and an enhanced, GPT-4-powered edition of its ChatGPT bot. On Wednesday, the U.S. Copyright Office issued a new policy statement regarding the registration of works created in whole or in part by generative AI tools and announced plans to launch a formal inquiry into “a wide range” of issues arising from generative AI.

And on Friday, a new coalition of more than 30 creator and rights owner organizations, calling itself the Human Artistry Campaign, unveiled a set of seven “core principles” it would like to see guide the development of generative AI technology, and in particular the use of copyrighted works in training AI models.

And those were just the biggest headlines.

Given how long it generally takes to get 30 or more signoffs on a press release, it’s a fair bet the rollout of the HAC has been in the works for some time and that the unveiling of its seven principles last week was not directly tied to the GPT-4 announcement. But that’s not to say the announcements were unrelated.

After some anodyne throat clearing in the first two items on its list, the HAC statement takes direct aim at large, pre-trained models like GPT-4:


We fully recognize the immense potential of AI to push the boundaries for knowledge and scientific progress. However, as with predecessor technologies, the use of copyrighted works requires permission from the copyright owner. AI must be subject to free-market licensing for the use of works in the development and training of AI models. Creators and copyright owners must retain exclusive control over determining how their content is used. AI developers must ensure any content used for training purposes is approved and licensed from the copyright owner, including content previously used by any pre-trained AIs they may adopt [emphasis added]

As discussed here in previous posts the question of whether the training of an AI model actually implicates any licensable right as defined in the Copyright Act or case law is an open one. But even if courts were to determine it does, the GPT-4 announcement illustrates another major challenge confronting creators and rights owners: the sheer scale at which any such licensing system would need to operate to be equitable.

Models like OpenAI’s GPT tool are called Large Language Models because the amount of training data they process is immense, and growing by orders of magnitude with each iteration.

GPT-3, for instance, is said to have processed around 45 Terabytes of textual data, roughly the entire textual content of the public web. That huge cache was compiled not by selecting the material to be included but by use of an automated spider to crawl the web and scrape whatever it found, the same way Google crawls basically every public website in order to index them for its search engine.

Even that volume of data, however, is dwarfed by GPT-4, which was trained on both text and images. The total amount of data processed in its training is said to be 571 times larger than its predecessor. That’s 51 times larger than Jupiter is of the Earth.

Estimated number of items processed

OpenAI says the data for GPT-4 came from “a variety of licensed, created, and publicly available data sources, which may include publicly available personal information.” Given how much data we’re talking about, it’s fair to assume the amount of “publicly available” information scraped from the open web is likely many times larger than any amount of licensed content that may have been included.

The creative industries have dealt with the challenges of licensing large bodies of works before, of course. The music industry developed its collective management and blanket licensing system; many photo agencies offer subscription access and re-use rights to their full collections. But as far from fool-proof as even those systems are, the amount of content being ingested by so-called foundation AI models like GPT-4 is orders is almost unimaginably larger, much of it lacking any organized registration of ownership or provenance. And there’s no reason to think that difference in scale won’t continue to grow.

Given that scale, and the difficulty, if not impossibility, of attributing any particular piece of generative AI output to any particularly input source or sources, it will take a feat of imagination on the part of many stakeholders to devise and equitable licensing system.

One possible approach might be a blanket prohibition on the use of copyrighted works in AI training without express permission from the rights owners. But given the ambiguity around the application of the existing copyright statute to the realities of machine learning it might take an act of Congress to impose such a prohibition. And, given the geopolitical competitive concerns around the development of AI technology, persuading Congress to impose a potentially significant constraint on research and development of the technology would be a heavy list.

Another possibility would be an of opt-out system for rights owners, in which AI developers were required to recognize and respond to some sort of flag, along the lines of the robots.txt flag, preventing their content from being scraped and fed into an AI without permission. That would put the burden on rights owners, however, to which many would likely object. It’s also unclear that the content owned by any given publisher, if excluded from training, would meaningfully affect the quality of the generative model, leaving publishers with little leverage against AI developers.

As for The Copyright Office’s statement, it focused primarily on whether works made by or with the aid of generative AI systems are eligible for copyright protection, rather than on policy issues related to training data. But the two sets of issues are not so easily disentangled.

In discussing how it will apply the human authorship requirement for registration, the policy statement notes:

In the case of works containing AI-generated material, the Office will consider whether the AI contributions are the result of “mechanical reproduction” or instead of an author’s “own original mental conception, to which [the author] gave visible form.” The answer will depend on the circumstances, particularly how the AI tool operates and how it was used to create the final work. This is necessarily a case-by-case inquiry.

How an AI tool operates, of course, depends on how you interpret what happens in the training of the tool. And conducting a case-by-case inquiry is likely to quickly run up against the same challenges of scale as developing a licensing system for AI training.

In short, actually implementing anything like policy demands made by the Human Artistry Campaign is going to take a lot more hard thinking about how generative AI tools are actually developed, the nature and substance of human vs. machine creativity, and the broader public policy issues at stake than we got last week.

1 Comment

Comments are closed.

Get the latest RightsTech news and analysis delivered directly in your inbox every week
We respect your privacy.