Large Language Models (LLMs) require large amounts of data for training. Very large. Like the entire textual content of the World Wide Web large. In the case of the largest such models — OpenAI’s GPT, Google’s Gemini, Meta’s LLaMA, France’s Mistral — most of the data used is simply vacuumed up from the internet , if not by the companies themselves then by third-party bot-jockeys like Common Crawl, which provides structured subsets of the data suitable for AI training. Other tranches come from digitized archives like the Books 1, 2 and 3 collections and Z-Library.
In nearly all cases, the hoovering and archive-compiling has been done without the permission or even the knowledge of the creators or rights owners of the vacuumed-up haul.