The distinguishing characteristic of large language models (LLMs) is, as the name implies, their sheer size. Meta’s LLaMA-2 and OpenAI GPT-4 are each comprised of well more than 100 billion parameters — the individual weights and variables they derive from their training data and use to process prompt inputs. Scale is also the defining characteristic of the training process LLM’s undergo. The datasets they ingest and are almost incomprehensively large — equivalent to the entire World Wide Web — and require immense amounts of computing capacity and energy to analyze.
Generative AI models don’t need to be that large to be useful, however. Researchers at Microsoft, for instance, recently published a technical report on a language model they call phi-1.5 comprised of a mere 1.3 billion parameters, or about one one-hundredth the size of GPT-3.5, which powered the original ChatGPT. A new version, called phi-2, contains 2.7 billion parameters.
Those small language models were trained on correspondingly smaller, more selective datasets. And while not as capable as LLMs, in benchmarking tests phi displayed capabilities comparable to models 5 to 10 times larger, according to the researchers. They even exhibited multi-modality, or the ability to process images as well as texts.
If assigned to the right tasks, such small language models could prove to be perfectly suitable alternatives to large foundation models like GPT and LLaMA. And they could be developed and deployed at a fraction of the cost and with a fraction of computing capacity and energy requirements.
They could also help address other challenges posed by the rise of generative AI technology.
The scale of LLMs is a major contributing factor to the controversy over the use of copyrighted material in training. Even if it were to be established that any such use requires authorization from the rights owner, the sheer volume and diversity of data involved will make it extremely difficult to devise and implement a fair and manageable system to administer those authorizations.
Scale is also likely to defeat any attempt to attribute the output of LLMs to particular inputs for purposes of remuneration. With hundreds of billions of discrete parameters within a model, it is effectively impossible to know what any one parameter is accomplishing or to trace the calibration of that parameter to a particular input or set of inputs.
Scaling down the size of the models could help on both scores. If smaller but still capable models can be trained on smaller, more tightly curated datasets, the process would more readily lend itself to licensing. A few large archives licensed from a small number of individual rights owners might be sufficient to train models for specific applications.
The contributions of each archive to the model would also be less opaque, making attribution more plausible.
StabilityAI’s Stable Audio model, for instance, was trained on a mere 800,000 licensed tracks, for which rights owners were remunerated.
Even a single archive, if sufficiently comprehensive, could be adequate to train a usable, task-specific model.
Getting Images, for instance, recently unveiled an image generator in partnership with Nvidia that it claims was trained exclusively with Getty’s own library of licensed photos.
Adobe’s Firefly image generator was trained entirely on its own archive of photos and other images, according to the company.
The Big 3 record companies are all investigating creating music generator models that could be trained exclusively on their own internal libraries of tracks.
In addition to being cheaper for developers to create, small models trained on limited datasets offer benefits to users compared to relying large, foundational language and diffusion models. Foremost among those benefits is protection against potential copyright liability risk that can come with using applications built on Stable Diffusion, GPT or other large foundational models.
While most of the copyright litigation against generative AI developers has so far targeted foundation models, the companies behind those models are working feverishly to shift liability onto downstream application developers and closer to the end user (see below).
Models trained on smaller, more selectively curated datasets might also prove better suited to specific use cases
In short, while the training and use of large foundation models may continue to pose difficult legal and copyright policy challenges due to their scale, traditional market forces increasingly could shift much of ordinary business and consumer use of generative AI technology to smaller, task-specific models trained on liability-free datasets. That could help shift the discussion around AI from an argument over fair use and derivative works to the more mundane concerns of cost, time-to-market and product design.