AI’s appetite for scraped content, without returning readers, is leaving site owners and content creators fighting for survival. Both search and AI use the results of absolutely ginormous scraping and spidering operations, but one provides benefits to the scrapees, while the other profits enormously from the work of others while simultaneously destroying their motivation to keep doing the work.
Source: How web scraping actually works – and why AI changes everything

The prospect of relying heavily on synthetic data hasn’t gone unnoticed by the creative industries. “I believe the main reason companies like OpenAI are having to rely more on synthetic data now is that they’ve run out of high-quality human created data to mine from the public facing internet,” says Reid Southern, a film concept artist and illustrator, adding, “It further distances them from any copyrighted materials they’ve trained on that could land them in hot water.”


On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities. The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s preferences,” Cloudflare’s researchers wrote.

An estimated 5.6% of U.S. search traffic on desktop browsers last month went to an AI-powered large language model like ChatGPT or Perplexity, according to Datos, a market intelligence firm that tracks web users’ behavior. That pales beside the 94.4% that still went to traditional search engines like Alphabet’s Google or Microsoft’s Bing. But the percentage of traffic that went to browser-based AI search has more than doubled since June 2024,