We knew that OpenAI, Google and Meta relied on copyrighted material to train their generative AI models. The companies themselves have acknowledged as much by raising a fair use defense in the myriad lawsuits brought against them by copyright owners, including in the New York Times Co.’s copyright infringement lawsuit against OpenAI and Microsoft.
We also know that AI developers are increasingly desperate for new sources of high-quality data to train on as they rapidly exhaust the published contents of the World Wide Web, and are pushing the envelope in the pursuit of untapped resources.
So I’m not sure the revelation in the Times’ big 3,200-word, five-reporter “examination” published over the weekend that employees at the three companies internally “debated” or “discussed” the legal and ethical niceties involved quite cuts it as news. (I’m not a lawyer, but I also wonder how the story and its sourcing might play in discovery in the Times Co. suit against OpenAI, should the case get that far.)
The only identifiable sources referenced in the piece, moreover, are some internal messages the Times “viewed,” some recordings of internal discussions it “obtained,” and some pro forma, unrevealing statements from the company spokespeople. The only direct quotes from any senior executives are culled from earnings calls and industry conferences. The only named sources who appear to have spoken directly to the Times are a couple of unaffiliated lawyers, a Johns Hopkins University professor, a former researcher at OpenAI now at Anthropic, and Justine Bateman.
There is nothing inherently wrong with relying on unnamed sources, especially in investigative pieces where there is a risk for sources in going on the record. But it’s not clear from the reporting how close the “people with knowledge” of conversations referenced in the piece were to the actual conversations. And the resort to months-old quotes from public forums suggests the porridge the five reporters on the story were able to serve up came out thinner than the recipe promised.
That all said, the piece is still useful for putting a spotlight on the increasingly contentious debate over data transparency, which is likely to play a central role in the legal and legislative sparring over AI over the next year.
Rights owners have been clamoring, in congressional hearings and other official forums, for greater transparency onto the datasets AI companies use to train their models. Those demands are also likely to feature prominently in the discovery phase of any lawsuits against AI companies that get that far.
For their part, AI companies have grown increasingly parsimonious with the information they release about their training data, precisely because it has been the focus of much of the litigation they are facing.
Over the rest of 2024, however, the main action is likely to shift to Europe. The EU AI Act adopted by the European Parliament last month, will require that developers of “General Purpose AI Models,” such as those from OpenAI, Google and Meta, to “draw up and make publicly available a sufficiently detailed summary about the content used for training…according to a template provided by the AI Office.”
That and other provisions of the law will not start to take effect until early 2025. Between now and then, though, the newly formed EU AI Office is charged with developing a template for such summaries “which should be broadly comprehensive in its scope, rather than technically detailed, to facilitate parties with legitimate interests in exercising their rights. The training data content summaries may list the main data collections or sets used, such as large private or public databases or data archives, while providing a narrative explanation about other data sources used” (emphasis added).
That language (which could still be tweaked a bit by the lawyers before official publication in the EU Journal) is sufficiently open-ended as to all-but guarantee a fierce battle over the final text and format of the template. As of now, at least, there’s a big, fat “may” in that last sentence quoted above, rather than a “shall,” which provides plenty of wiggle room for lawyers and lobbyists to fight over.
Whatever the AI Office comes up with is also likely to inform any action Congress might take here regarding disclosure.
On Tuesday (4/9), in fact, Rep. Adam Schiff (D-Calif.) introduced the Generative AI Copyright Disclosure Act. The bill would require AI companies to submit notice to the U.S. Copyright Office prior to the release of any new generative AI system disclosing all copyrighted works used in building or altering the training dataset for that model. The law would apply both to new models and retroactively to models already on the market.
“We must balance the immense potential of AI with the crucial need for ethical guidelines and protections,” Schiff said in a statement. “My Generative AI Copyright Disclosure Act is a pivotal step in this direction. It champions innovation while safeguarding the rights and contributions of creators, ensuring they are aware when their work contributes to AI training datasets.”
AI’s implications for intellectual property are both profound and complex. But data transparency is quickly emerging as a critical threshold issue to be resolved before those other implications can be meaningfully addressed.