Spotlight On Data Transparency

We knew that OpenAI, Google and Meta relied on copyrighted material to train their generative AI models. The companies themselves have acknowledged as much by raising a fair use defense in the myriad lawsuits brought against them by copyright owners, including in the New York Times Co.’s copyright infringement lawsuit against OpenAI and Microsoft.

We also know that AI developers are increasingly desperate for new sources of high-quality data to train on as they rapidly exhaust the published contents of the World Wide Web, and are pushing the envelope in the pursuit of untapped resources.

Fixing AI’s Market Failure

Large Language Models (LLMs) require large amounts of data for training. Very large. Like the entire textual content of the World Wide Web large. In the case of the largest such models — OpenAI’s GPT, Google’s Gemini, Meta’s LLaMA, France’s Mistral — most of the data used is simply vacuumed up from the internet , if not by the companies themselves then by third-party bot-jockeys like Common Crawl, which provides structured subsets of the data suitable for AI training. Other tranches come from digitized archives like the Books 1, 2 and 3 collections and Z-Library.

In nearly all cases, the hoovering and archive-compiling has been done without the permission or even the knowledge of the creators or rights owners of the vacuumed-up haul.

Dancing With the AI Devil

Fresh off scaring the bejeezus out of many in Hollywood with demos of its text-to-video generator Sora, OpenAI now wants in. According to Bloomberg, top executives at the generative AI developer will hold a round of meetings this week with a number of film studios and Hollywood honchos to discuss what Sora can do for them.

We’ve discussed here before why everything that can be made with AI, will be in Hollywood. So it is no great surprise that studio folks would take the meetings. But appearing to get cozy with Sora right now carries significant risk for the studios.

Generative AI was recently at the center of extensive labor unrest in Hollywood that cost the studios the better part of a year’s worth of production. As a result of that unrest, they are also now bound by collective bargaining agreements with writers and actors that circumscribe what they can do unilaterally with tools like Sora.

Will the Price Be Right for AI Training Rights?

We’ve said it before, and now we can say it again: Don’t sleep on the Federal Trade Commission when it comes to a regulatory response to the rise of generative AI. On Friday, Reddit filed an amended S-1 registration statement for its planned IPO in which it disclosed that the FTC has begun investigating its data licensing program for AI training.

“[O]n March 14, 2024, we received a letter from the FTC advising us that the FTC’s staff is conducting a non-public inquiry focused on our sale, licensing, or sharing of user-generated content with third parties to train AI models,” the amended S-1 said. “Given the novel nature of these technologies and commercial arrangements, we are not surprised that the FTC has expressed interest in this area. We do not believe that we have engaged in any unfair or deceptive trade practice.”

Finetuning AI Copyright Infringement Claims

Stop me if you’ve heard this one, but a group of authors has filed a prospective class action lawsuit against the developer of a generative AI model alleging copyright infringement. Filed Friday (March 8) in the Northern District of California, the suit targets Nvidia, the chipmaker whose GPUs are widely used in data centers to handle the massive computing work required to train and manage generative AI models, but which also provides its own Large Language Models as part of its NeMo Megatron AI development tool kit.

The complaint names three plaintiffs, authors Abdi Nazemian, Brian Keene and Stewart O’Nan, but seeks money damages on behalf of “All persons or entities domiciled in the United States that own a United States copyright in any work” used in training the Nvidia LLM, known as NeMo Megatron.

Anything That Can Be Made With AI Will Be, In Hollywood

At the risk of belaboring the obvious, generative AI is now everywhere in the media and rights-based industries. It’s writing news articles and fan-fic e-books, it’s making music, it’s creating artwork. But no creative industry will be transformed by AI quite as much as movie and television production. The reason has as much to do with economics as technology.

Warner Bros.’ “Dune: Part Two” opened to a whopped $81.5 million domestically over the weekend, and $97 million internationally. It brought a welcome boost to theaters, which had seen the number of butts in seats come crashing down from the summer’s “Barbenheimer” high. And it showed that big-budget, effects-driven spectacles can still deliver for a studio, especially if they’re spectacular enough to justify release on large-format screens, like IMAX, which carry a premium ticket price and accounted for 48% of “Dune’s” domestic tally.

AI and the News: Deal, or No Deal?

Reddit, the self-anointed “front page of the internet,” sits atop a huge archive of original content. It contains more than a billion posts created by its 73 million average daily unique users self-organized into more than 100,000 interest-based communities, or subreddits, ranging from sports to politics, technology, pets, movies music & TV, health & nutrition, business, philosophy and home & garden. You name it, there’s likely to be a subreddit for it.

The scale and diversity of the Reddit archive, replete with uncounted links to all corners of the World Wide Web and made freely accessible via API, has long-been a highly valued resource for researchers, academics and developers building third-party applications for accessing Reddit communities. More recently, it has also eagerly been mined by developers of generative AI tools in need of large troves of natural language texts on which to train their models.

Fighting Deep Fakes: IP, or Antitrust? (Updated)

The Federal Trade Commission last week elbowed its way into the increasingly urgent discussion around how to respond to the flood of AI-generated deep fakes plaguing celebrities, politicians, and ordinary citizens. As noted in our previous post, the agency issued a Supplemental Notice of Proposed Rulemaking (SNPRM) seeking comment on whether its recently published rule prohibiting business or government impersonation should be extended to cover the impersonation of individuals as well.

The impersonation rule bars the unauthorized use of government seals or business logos when communicating to consumers by mail or online. It also bans the spoofing of email addresses, such as .gov addresses, or falsely implying an affiliation with a business or government agency.

Suddenly, Everyone is Adding Watermarks to AI Generated Media

With election season in full swing in the U.S. and European Union, and concern growing over deep-fake and AI-manipulated images and video targeting politicians as well as celebrities, AI heavyweights are starting to come around to supporting for industry initiatives to develop and adopt technical standards for identifying AI-produced content.

At last month’s World Economic Forum in Davos, Meta president of global Affairs Nick Clegg called efforts to identify and detect AI content “the most urgent task” facing the industry. The Facebook and Instagram parent began requiring political advertisers using its platforms to disclose whether they used AI tools to create their posts late last year. But it is also now gotten behind the technical standard developed by the Coalition for Content Provenance and Authenticity (C2PA) for certifying the source and history of digital content.

Copyright and AI: Where’s the Harm?

Berkley law professor Pamela Samuelson has ruffled more than a few feathers among creators and rights owners over the years. In her role as co-founder and chair of the Authors Alliance, her seats on the boards of the Electronic Frontier Foundation and Public Knowledge, and in spearheading the American Law Institute’s controversial restatement of copyright law, she has been a high-profile and vocal skeptic of expansive views of copyright protections, particularly in the realm of digital platforms and technologies.

Get the latest RightsTech news and analysis delivered directly in your inbox every week
We respect your privacy.