Data

How web scraping actually works – and why AI changes everything

AI’s appetite for scraped content, without returning readers, is leaving site owners and content creators fighting for survival. Both search and AI use the results of absolutely ginormous scraping and spidering operations, but one provides benefits to the scrapees, while the other profits enormously from the work of others while simultaneously destroying their motivation to keep doing the work.

Source: How web scraping actually works – and why AI changes everything

Publisher traffic sources: Google steady but social and direct referrals are down

New data from Chartbeat suggests that “search” as a source of total traffic to major news publishers has remained stable over the last year. This appears to chime with a Google statement earlier this month downplaying the impact of AI Overviews and AI Mode on publisher referrals. However, this includes Google Discover – which has replaced search as the main source of Google traffic. Social media has however sharply declined as a source of publisher traffic in recent years, as has direct traffic.

Source: Publisher traffic sources: Google steady but social and direct referrals are down

Synthetic data is the new AI gold rush, but critics call it ‘data laundering’

The prospect of relying heavily on synthetic data hasn’t gone unnoticed by the creative industries. “I believe the main reason companies like OpenAI are having to rely more on synthetic data now is that they’ve run out of high-quality human created data to mine from the public facing internet,” says Reid Southern, a film concept artist and illustrator, adding, “It further distances them from any copyrighted materials they’ve trained on that could land them in hot water.”

Source: Synthetic data is the new AI gold rush, but critics call it ‘data laundering’

Publishing Giants Escalate War on ‘Shadow Libraries’ With Broad Cloudflare Subpoena 

Major academic publishers, including Elsevier and Springer Nature, are trying to unmask the operators of several shadow libraries including Anna’s Archive, Z-Library and Libgen. They’re also targeting SLUM, a third-party uptime monitor for these unofficial libraries. A DMCA subpoena, issued by a D.C. federal court, requires Cloudflare to hand over identifying user data for possible legal action.

Source: Publishing Giants Escalate War on ‘Shadow Libraries’ With Broad Cloudflare Subpoena * TorrentFreak

Introducing the Authenticity & Content Provenance Maturity Model

From AI‑generated product shots slipping into online catalogs, to manipulated photos influencing legal disputes, the trustworthiness of visual content has never been more in question. In an environment where seeing is no longer believing, brands, media organizations, and cultural institutions can no longer afford to ignore the issue of content authenticity.

Source: Introducing the Authenticity & Content Provenance Maturity Model – Kaptur

Perplexity Says Cloudflare Is Blocking Legitimate AI Assistants

Perplexity published a response to Cloudflare’s claims that it disrespects robots.txt and engages in stealth crawling. Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants. According to Perplexity, its system does not store or index content ahead of time. Instead, it fetches webpages only in response to specific user questions.

Source: Perplexity Says Cloudflare Is Blocking Legitimate AI Assistants

Perplexity accused of scraping websites that explicitly blocked AI scraping

On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities. The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages “in an attempt to circumvent the website’s preferences,” Cloudflare’s researchers wrote.

Source: Perplexity accused of scraping websites that explicitly blocked AI scraping | TechCrunch

Copyright Lawsuit Accuses Meta of Pirating Adult Films for AI Training * TorrentFreak

Adult film producers Strike 3 Holdings and Counterlife Media have filed a significant copyright infringement lawsuit against tech giant Meta. A complaint filed at a California federal court alleges that their films were downloaded via BitTorrent for AI training purposes. With at least 2,396 movies at stake, potential damages could exceed 350 million dollars.

Source: Copyright Lawsuit Accuses Meta of Pirating Adult Films for AI Training * TorrentFreak

Artists rage over changes to WeTransfer’s new terms of service

If you have ever needed to send a file larger than 20mb, you have probably used or at least heard of the online file-sending service WeTransfer. You may have also heard, earlier this month, a chorus of uproar on social media led by artists sharing screenshots of WeTransfer’s updated terms of service agreement that granted the company the right to use all materials transferred via their service, without any remuneration to the uploader or regard for their privacy.

Source: Comment | As artists rage over changes to WeTransfer’s terms of service, here’s why the company is now in its villain era

AI Search Is Growing More Quickly Than Expected

An estimated 5.6% of U.S. search traffic on desktop browsers last month went to an AI-powered large language model like ChatGPT or Perplexity, according to Datos, a market intelligence firm that tracks web users’ behavior. That pales beside the 94.4% that still went to traditional search engines like Alphabet’s Google or Microsoft’s Bing. But the percentage of traffic that went to browser-based AI search has more than doubled since June 2024,

Source: AI Search Is Growing More Quickly Than Expected

Get the latest RightsTech news and analysis delivered directly in your inbox every week
We respect your privacy.