Websites fail to block new AI scrapers as companies launch fresh bots

Websites are blocking outdated Anthropic scrapers, missing new ones

Hundreds of websites are attempting to prevent Anthropic from scraping their content. They’re blocking outdated scrapers, specifically “ANTHROPIC-AI” and “CLAUDE-WEB.” The root cause lies in outdated instructions being copied into robots.txt files.

As AI companies continuously launch new crawlers under different names, frequently updating the robots.txt files is a must.

Popular websites, including Reuters.com and those under the Condé Nast family, are blocking these obsolete bots but fail to block the currently active “CLAUDEBOT.”

Dark Visitors tracks and assists in updating robots.txt files to block AI scrapers

The anonymous operator of Dark Visitors describes the current robots.txt situation as chaotic. Dark Visitors tracks hundreds of web crawlers and scrapers, giving website owners tools to regularly update their robots.txt files.

This service has seen a surge in popularity as more site owners seek to prevent AI companies from scraping their content. Recently, new scrapers from Apple (Applebot-Extended) and Meta (Meta-ExternalAgent) were introduced, adding to the complexity.

Some sites, like Reddit, have resorted to blocking all crawlers except for a select few, which affects search engines, internet archiving tools, and academic research, often unintentionally.

Real-world examples show the true cost of unchecked AI scraping

iFixit, a repair guide website, reported nearly a million hits from Anthropic’s crawlers in one day, illustrating the sheer scale of the issue.

Similarly, Read the Docs, a coding documentation deployment service, experienced crawlers accessing 10 TB worth of files in a single day, leading to $5,000 in bandwidth costs—prompting Read the Docs to block the offending crawler.

Both companies advocate for AI firms to respect the sites they crawl to avoid indiscriminate blockages due to perceived abuse.

Data Provenance Initiative highlights confusion in blocking AI scrapers

Blocking AI scrapers places the responsibility squarely on website owners, which is being further complicated by the rapidly increasing number of scrapers.

The Data Provenance Initiative’s findings show that some bots listed on robots.txt files lack clear origins or connections to their supposed companies, muddying the waters.

For instance, the origin of “ANTHROPIC-AI” remains unclear, with no public evidence of its existence beyond its appearance on many different blocklists. Anthropic acknowledges that both “ANTHROPIC-AI” and “CLAUDE-WEB” are no longer in use but did not clarify whether the active “CLAUDEBOT” respects robots.txt directives meant for the older bots.

Comments from experts on AI scrapers

Top experts from the community have shared their opinions on the current issues many popular websites are facing as they try to block AI crawlers from accessing their content:

Shayne Longpre: Points out the prevalent issue of websites blocking outdated Anthropic agents while missing the active CLAUDEBOT.

Robb Knight: Highlights the difficulty in verifying user agents, using the example of “Perplexity-ai,” which many sites block despite the real scraper being named “PerplexityBot.”

Walter Haydock: Recommends an aggressive blocking strategy for suspected AI crawlers, citing the inherent lack of transparency and uncertainty in AI training processes.

Cory Dransfeldt: Echoed Haydock’s sentiment, supporting aggressive blocking practices and maintaining an AI bot blocklist on GitHub.

The ripple effect of AI scrapers: What it means for content creators

The increasing confusion and difficulty in managing AI scrapers lead to broader implications for content creators. Many might opt to move their content behind paywalls to prevent unrestrained scraping.

Anthropic’s has stated a commitment to respecting outdated robots.txt preferences in a bid to reduce friction and align with website owners’ intentions, though this is a complex and evolving issue.

Anthropic’s response to the blocking confusion

Anthropic’s spokesperson confirmed that “ANTHROPIC-AI” and “CLAUDE-WEB” are deprecated. Despite this, CLAUDE-WEB was operational until recently, seen as recently as July 12 on Dark Visitors’ test website.

Anthropic has reconfigured “CLAUDEBOT” to respect robots.txt directives set for the old agents, attempting to align with website owners’ preferences even if their robots.txt files are outdated.

Final thoughts

Outdated defenses leave websites vulnerable, leading to potentially large costs and major disruptions. To stay protected, website owners must regularly update blocking protocols, adopt aggressive strategies, and even consider moving content behind paywalls or account-based access.

Companies must adopt proactive measures and stay informed about new scraper technologies if they are to protect their digital assets.

Paul

August 5, 2024

4 Min