European Union officials made incremental progress in negotiations last week over the final text of the EU AI Act, which could establish a de facto global benchmark for regulating artificial intelligence technology and companies, according to local reports, Known as a trilogue, the negotiations, among the EU Parliament, Commission and Council of Ministers, is the final legislative stage before formal adoption of the framework that will be sent to EU member states for implementation in national laws. A fifth and potentially final round of discussions is set for December, but the trilogue could continue if final agreement is not reached.
Among the issues that do seem settled, according to the reports, is the adoption of transparency rules for so-called foundation generative AI models like GPT, DALL-E, LLaMA and Stable Diffusion that are trained on huge amounts of texts and images mostly scraped from the internet. The rules would require their developers to document and disclose details of the modeling and training process, including the contents of training datasets. Negotiators have apparently agreed on a tiered-approach to the regulation in which all foundation and general application models would be subject to basic transparency rules while certain “very capable” models would face additional requirements. Discussions are still ongoing over the degree of detail that must be disclosed about training datasets and the precise benchmarks for the AI model tiers.
Artists and rights owners on both sides of the Atlantic have been demanding such transparency requirements in comments to regulators and policymakers. If adopted, the disclosure requirements in the AI Act could go a long way toward satisfying those demands, as disclosure in Europe is effectively disclosure everywhere.
According to a recent study by researchers at Stanford, however, AI companies have lately been getting less transparent about most aspects of their foundation models, not more open. On the study’s 100-point index, based on researchers’ evaluation of 100 different aspects of transparency, the highest score was 54 while the lowest was 12.
Although not addressed by the AI Act, many datasets commonly used to fine tune foundation models for specific tasks, particularly open-source aggregations widely available on sites like Git Hub and Hugging Face, also fall short on transparency. A recent audit of 1,800 fine tuning datasets conducted by researchers behind the Data Provenance Initiative, found that 70% did not specify what license terms applied to the data. Many also contained data that had been mislabeled with more permissive license terms than the original creators of the data intended, such as allowing commercial use of data originally compiled for academic research purposes.
The researchers, drawn from a wide range of academic and industrial organizations, have developed an online tool called the Data Provenance Explorer that allows anyone to track the sources of data as best as could be determined for all 1,800 audited datasets.
Many artists and rights owners also see full disclosure of training datasets as a critical step toward establishing clear liability for the unlicensed use of their works to train AI models and the foundation for an eventual licensing system. Evading that liability is also likely one reason AI companies are getting more reluctant to disclose how they’re training their models.
Under the AI Act, AI models would also be required to comply with the terms of the EU Copyright Directive, which generally permit the use of publicly available data for purposes of research and analysis, unless the rights owner has requested their work not be included.
While that could make opt-out the de facto international standard for the use of copyrighted works in AI training, the AI Act itself is silent on how such a system should be managed, how rights owners should assert their opt-out rights, or how compliance by AI companies should be enforced.
If those details are left to individual countries’ implementation of the AI Act in national laws it could leave rights owners having to navigate a patchwork of varying opt-out systems and raise the cost of compliance for AI companies.
That can only complicate efforts to establish a comprehensive licensing system for AI training no matter how transparent the process.