Finetuning AI Copyright Infringement Claims

Stop me if you’ve heard this one, but a group of authors has filed a prospective class action lawsuit against the developer of a generative AI model alleging copyright infringement. Filed Friday (March 8) in the Northern District of California, the suit targets Nvidia, the chipmaker whose GPUs are widely used in data centers to handle the massive computing work required to train and manage generative AI models, but which also provides its own Large Language Models as part of its NeMo Megatron AI development tool kit.

The complaint names three plaintiffs, authors Abdi Nazemian, Brian Keene and Stewart O’Nan, but seeks money damages on behalf of “All persons or entities domiciled in the United States that own a United States copyright in any work” used in training the Nvidia LLM, known as NeMo Megatron.

If it sounds familiar it’s because pair of lawyers that filed the complaint, Joseph R. Saveri and Matthew Butterick, also filed the Sarah Silverman et. al. lawsuit against OpenAI, also for copyright infringement, in the same court last year.

Nvidia describes NeMo Megatron as “an end-to-end platform that delivers high training efficiency across thousands of GPUs and makes it practical for enterprises to deploy large-scale [Natural Language Processing]. It provides capabilities to curate training data, train large-scale models up to trillions of parameters and deploy them in inference.”

Released in late 2022, the platform includes a series of LLMs comprised of NeMo Megatron-GPT 1.3B, GPT 5B, GPT 20B, and NeMo Megatron-T5 3B, which are hosted on the Hugging Face website. The “B” designation for each refers to the billions of parameters the model contains.

According to the complaint, the NeMo LLMs relied for their training data on The Pile, an 825 gigabyte amalgam of diverse data sets compiled by EleutherAI designed for training language processing models. Among the data sets it included at the time it was compiled was the notorious Books3 set alleged to include numerous pirated texts among the 196,640 books it contains, including those written by the plaintiffs.

In a sign that we may be moving beyond the spaghetti-at-the-wall phase of AI copyright lawsuits, however, the Nvidia lawsuit is more narrowly focused than the more than two dozen previous such cases.

The initial batch of lawsuits filed against AI systems floated a cornucopia of copyright infringement theories — direct infringement, contributory infringement, vicarious infringement, that’s-no-fair infringement, etc. — often accompanied by various trademark and unfair competition claims, and a variety of state law torts.

By contrast, the complaint against Nvidia weighs in at a lean 10 pages, tightly focused on a single count of direct copyright infringement arising from the use of the plaintiffs’ works to train the Megatron models.

Like an AI model itself, plaintiffs and their lawyers in the case seem to be refining the weights in their legal strategy based on previous inputs. The Silverman case, for instance, like many of the other lawsuits brought against OpenAI, Stability AI, Meta, Midjourney and others, has seen most of the charges dismissed, as courts have whittled the cases down primarily to the question of direct infringement during training.

The Nvidia complaint is also more refined than its predecessors in how it frames the allegations. Here’s how it describes the allegedly infringing conduct:

24. The Pile is a training dataset curated by a research organization called EleutherAI. In December 2020, EleutherAI introduced this dataset in a paper called “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”3 (the “EleutherAI Paper”).

25. According to the EleutherAI Paper, one of the components of The Pile is a collection of books called Books3. The EleutherAI Paper reveals that the Books3 dataset comprises 108 gigabytes of data, or approximately 12% of the dataset, making it the third largest component of The Pile by size…

30. Until October 2023, the Books3 dataset was available from Hugging Face. At that time, the Books3 dataset was removed with a message that it “is defunct and no longer accessible due to reported copyright infringement.”

31. In sum, NVIDIA has admitted training its NeMo Megatron models on a copy of The Pile dataset. Therefore, NVIDIA necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of The Pile. Certain books written by Plaintiffs are part of Books3— including the Infringed Works—and thus NVIDIA necessarily trained its NeMo Megatron models on one or more copies of the Infringed Works, thereby directly infringing the copyrights of the Plaintiffs.

In many of the earlier cases, courts have dismissed most of the non-direct infringement claims as a matter of law for failing to allege a plausible theory of infringement. Here, the plaintiffs attempt to connect a series of documented dots leading directly from the defendant to an act of direct infringement.

I will leave it to actual lawyers to pass on whether that attempt is likely to meet the bar for properly and adequately pleading an infringement by the defendant, or if it, too, will be barred as a matter of law. If the court finds the pleading sufficient, however, dispensing with extraneous charges would also have the advantage of allowing plaintiffs (and to court) to skip over a lot of pre-trial maneuvering and motions to dismiss, and to more quickly get the critical question before a jury: Did Nvidia directly infringe the plaintiffs’ copyrights by knowingly relying on pirated copies of their works to train the NeMo Megatron LLMs.

It’s the question on which all other theories of infringement, involving the output of a generative model, the apportionment of liability among the developer, marketer and end user of the model, or whether models themselves are infringing, are likely to turn: Is what happens during the training of a generative AI model a copyright infringement.

Worth keeping an eye on this one.

Get the latest RightsTech news and analysis delivered directly in your inbox every week
We respect your privacy.