EXTRA As it did with data protection and privacy with its General Data Protection Regulation (GDPR), the European Union is again moving to establish a de facto global standard for the design and operation of generative artificial intelligence models and applications. Votes by two key committees of the European Parliament last week put the E.U.’s long-gestating AI Act on a glide path to adoption by the full legislature at a plenary session scheduled for mid-June. Once adopted, the proposed law would move to the final stage of the legislative process, involving three-way negotiations known as a trilogue among Parliament, the EU Commission and the EU Council.
From the start, the AI Act was primarily focused on data protection, privacy issues and risk factors around the use of artificial intelligence. But the version now poised for adoption was revised in response to the explosive growth in generative AI technology to include provisions addressed to the training of generative AI models and the handling of their output.
Those provisions do not explicitly refer to the intellectual property questions that have roiled the creative industries over generative AI. But they overlap many aspects of the discussion in this country around transparency into AI training datasets and could make themselves felt in the operations of U.S.-based AI developers, much as GDPR did to U.S.-based website publishers.
The AI Act distinguishes among three different categories of generative AI models — “foundation” models, “general purpose” models and “provider” models, imposing different requirements on each.
The most stringent requirements apply to so-called foundation models, defined as “an AI model that is trained on broad data at scale, is designed for generality of output, and can be adapted to a wide range of distinctive tasks.”
Developers of AI models that fall under that category, including GPT and Stable Diffusion, will be required to publish detailed summaries of any and all copyrighted works used to train the models to generate text, images, video or music that “resembles human work.” They will also be required to disclose that works generated by the model was produced by AI.
General purpose models, defined as “AI system[s] that can be used in and adapted to a wide range of applications for which [they were] not intentionally and specifically designed,” would not be directly regulated in the same manner as foundation models under the plan. But general purpose systems would have to support downstream operators’ compliance with disclosure requirements by providing all the relevant information and documentation on the AI model.
Those downstream operators are classified as “providers” of AI, a category that would cover most application-layer developers, and defined as “a natural or legal person, public authority, agency or other body that develops an AI system or that has an AI system developed with a view to placing it on the market or putting it into service under its own name or trademark, whether for payment or free of charge.”
Assuming the AI Act is to become law, it would be the first substantial regulatory regime established for artificial intelligence in major Western economies, and among the first specifically to address the design and operation of generative AI systems. But its provisions have echoes outside the EU.
In the first two of the U.S. Copyright Office’s four scheduled listening sessions on AI and copyright (see here and here), transparency into AI training datasets was a major theme of the discussions (the third in the series, dealing with audiovisual works, is scheduled for Wednesday, May 17).
The Human Artistry Campaign, made up of more than 3 dozen artist and rights owner groups, listed “complete recordkeeping of copyrighted works, performances, and likenesses,” used in training AI models among its 7 core principles for developing artificial intelligence applications.
While some U.S. AI developers have disclosed at least some of the sources of their training data, others, including OpenAI and Stability.ai, have been more coy, presumably to avoid alerting potential litigants. Some of their sources have leaked out, however. OpenAI’s GPT models, for instance, were revealed to have been trained on the open-source Common Crawl dataset, as well as a few other open resources, such as the Wikipedia and Reddit archives. But its training data also included repositories identified only as Books I and Books II, without any further information.
Stability.ai’s Stable Diffusion model was trained primarily on the LAION repository of images mostly scraped from the internet, but likely on other archives as well.
In testimony before the Senate Judiciary Committee this week, OpenAI CEO Sam Altman agreed on the need for AI regulation, primarily around risk factors and privacy and disinformation concerns. But he did not venture into the issues of transparency and disclosure of training materials.
Disclosing, or even merely summarizing all of the potentially copyrighted material the petabytes of data in all the archives used in will be a monumental task. But it is viewed by many artists and rights owners, including members of the Human Artistry coalition, as an essential first step for establishing some sort of licensing system for the use of that material in training AI models.
The requirement in the AI Act that works produced by AI be disclosed as such could also become a factor in the debate around the Copyright Office’s recent guidance on registering works created partly by humans and partly by AI.
As was the case with privacy regulation, U.S. policymakers are far behind their E.U. counterparts on regulating AI. Moreover, while Congress and the White House have begun discussing various approaches, any new rules on this side of the Atlantic are likely to be piecemeal rather than comprehensive, and slow in coming.
“A.I. is developing so quickly, and we are behind the curve. My approach and my expectation is that we will find smaller pieces to get done this year, and then build on it next year,” Rep. Don Beyer (D-Va.) told The New Republic. While the E.U. approach may go “too far,” he said, “it’s fun to look and see, what can we copy from it that would be accessible in this culture, and these politics.”
Once disclosure of training material is mandated in the E.U., however, U.S. Congressional attitudes may not matter very much. Disclosure in Europe is effectively disclosure everywhere. And once armed with the information, the clamor from rights owners to create a licensing system for the use of copyrighted works to train AI here, whether by legislation or simply as a condition of AI developers’ compliance with E.U. rules, could drown out American squeamishness.