The U.S. Copyright Office last week held the first of its planned series of “listening sessions” on artificial intelligence and copyright, which focused on AI and literary works. It featured representatives from authors’ groups and speakers from the technology and academic worlds talking past each other for three hours.
I highly recommend watching the replay when it’s posted on the Copyright Office website.
I say that not to be facetious. Rather, it’s because the communication breakdown on display was itself illustrative of where the policy debate around AI and copyright currently stands: We lack the critical, let alone shared vocabulary to fairly debate what we’re trying to debate.
Case in point: Matthew Sag, a professor of law and artificial intelligence at Emory University, at one point in the discussion, referenced the Google Books case (Authors Guild v. Google, 2nd Cir. 2015) as precedent for permitting the use of copyrighted works in training AI models as a fair use under §107 of the Copyright Act.
“This technology is new and exciting, but many of the legal issues are not,” Sag said. “The test for infringement is copying, in fact, and substantial similarity, and that remains the same, no matter how a work is created. The copying required to collect the training data for these large language models is a classic form of non-expressive use that was upheld as fair use in Google books, and of course, HathiTrust (Authors Guild v. HathiTrust, 2nd Cir. 2014).”
While the Google Books case has often been cited by AI proponents, Copyright Alliance CEO Keith Kupferschmid was having none of it.
“Let me address the Google Books case and some of these other cases that we’re talking about, because I think if there’s one thing the Copyright Office takes away from this listening session, it should be this: The Google Books case could not be more different from what we have going on here,” he said. “Google did not copy books to make new books. That’s what AI does. They’re copying it, works of expression and copying copyright works, to make new copyrighted works that compete with the works that they are copying. In Google Books, Google used the works for informational purposes. They use it for the information in the works, not the expressive content of the works. That is exactly what AI is doing. They’re using the expressive content to produce new works.”
Leaving aside that that may be the first time I’ve heard a representative of copyright owners have anything positive to say about Google Books, at least in public, Kupferschmid, without intending to, actually makes the case for generative AI.
What a large language model does with its training data is precisely what he describes Google Books as doing: It extracts non-expressive information about the works, not their expressive content, and uses that data to construct a mathematical model of how and where words appear in texts. What copying goes on happens only when the model reads a text in RAM in order to analyze it, just as your laptop copies your browser application into RAM when you surf the web. The text itself is not retained.
That sort of transient, operationally necessary copying has long been held to be a fair use, so long as the temporary copy in RAM is not made permanent.
To rest the case on that, however, is to elevate form over substance. A generative AI model may not literally be copying its training data as a technical matter. But clearly, the data is adding value to the model, and right now we lack the statutory or common law precepts to be able to put an equitable price on that added value.
So if not copying, with all that implies from a liability perspective, what is it? Right now, the debate is hung up on the black-letter language of the law and precedent, as happens when you put a lot of lawyers in a (virtual) room. But we’re in a very gray area.
Two other broad themes that emerged during last week’s discussion:
1) Several speakers suggested or proposed a collective licensing regime for the use of copyrighted works in training AI models. Others referenced the licensing systems used for text and data mining (TDM) applications as a model. But those suggestions may not have adequately reckoned with the scale of the challenge.
OpenAI’s GPT-3 model was trained on roughly 45 terabytes of data, equivalent to nearly the entire textual content of the publicly accessible World Wide Web, far larger than the datasets used in most TDM programs. And GPT-4, the latest iteration, was reportedly trained on more than 500 times the volume of data as its predecessor. Establishing a collective licensing system that could reliably and equitably distribute royalties from use on that scale will be quite a feat.
Any collective licensing system for AI, moreover, would quickly run up against the problem of attribution. Looking only at the output of a generative AI model, it may appear that it is clearly based on or derived from an identifiable, preexisting work or works. But that apparent similarity simply reflects the predictive power of a large language model constructed from an incomprehensibly large number of parameters — 175 billion in the case of GPT-3 — derived from its training. It does not reflect a repurposing of the expressive content of any preexisting work or works.
Generative AI does not work from discreet sources. It works from a probabilistic model of language or images from which new works are algorithmically generated. It is not possible to establish, for the purpose of attribution, that 1.7% of a generated work came from Input A, and 2.4% came from Input B.
Any sort of blanket licensing system would need to figure out a new way to allocate the fees it collects.
2) There was one broad area of consensus among the participants in last week’s session. It was generally agreed that the guidance issued by the Copyright Office last month on registering works created in whole or in part by generative AI models, while helpful, is not really workable in its current form.
In particular, it was generally recognized that generative AI technology is becoming so deeply embedded in the workflows for authors and creators of all types, that the line between those elements in a work that were created by a human and those by machine will be increasingly difficult to define. The Copyright Office’s proposed case-by-case parsing of works, or a requirement that registrants pro-actively identify which elements in the submission were created by a human, and therefore eligible for copyright, and which my machine, and therefore ineligible, could quickly devolve into a free-for-all of appeals and litigation.
It would also raise a host of new questions regarding the licensing and re-use of hybrid works that would likely be disruptive to existing licensing markets.
The next session, on visual arts, is scheduled for May 2nd. The full schedule of upcoming listening sessions can be found on the Copyright Office website.
UPDATE: The Copyright Office has posted the list of designated speakers for its May 2nd visual arts listening session on AI and copyright.