
Cloudflare, which powers around 20% of the internet, recently published a detailed blog accusing Perplexity AI of deliberately bypassing website no‑crawl directives—a behaviour that undermines the foundational norms of the web.
Perplexity, the rapidly rising AI-powered answer engine, was found to:
- Ignore
robots.txt
fields completely—even from websites actively blocking its known crawler agents. That’s a fundamental breach of web courtesy and protocol. - When blocked, pivot to undisclosed IP addresses and entirely new ASNs, effectively masking its identity and circumventing restrictions.
- Resort to spoofed user agents, impersonating Chrome on macOS to disguise its crawling behaviour and evade browser-based detection or blocking.
These stealth tactics were uncovered via controlled tests: Cloudflare even created fresh domains with prohibitive robots.txt
files and blocked Perplexity’s official agents—but Perplexity still successfully delivered content from those sites when prompted—proving access had occurred despite the blocks.
As a result, Cloudflare delisted Perplexity from its Verified Bots program and deployed new heuristics across its managed rules to block the stealth crawler variant. CEO Matthew Prince called out Perplexity’s methods as akin to “North Korean hacker tactics,” and stated: “Time to name, shame, and hard block” those violating web norms. This incident followed Cloudflare’s broader move to enforce AI crawler restrictions by default, including its “Pay Per Crawl” model that allows publishers to charge for access to their content, even if robots.txt prohibits scraping.
A Crisis of Consent in the AI Age
We are increasingly reliant on AI services trained on live, uncached web content. When AI firms bypass fundamental web standards to gather data, it jeopardizes the ethical framework of content usage and publisher rights. Cloudflare’s findings contrast sharply with behaviour from OpenAI and other players, whose crawlers reportedly respect blocks and fallback when barred proving that ethical compliance is feasible even at scale.
Current web defenses like robots.txt have historically relied on trust; the Perplexity incident proves that without enforcement, bad actors can—and will—ignore rules. That erodes trust across the ecosystem.
Publishers and platform providers must insist on legitimizing AI crawlers through verifiable governance frameworks—not just technical allowances. Industry standards must evolve to include stronger auditability and transparency, ensuring that crawlers can be identified, purpose-validated, and revoked if they misbehave. AI companies need to internalize the importance of trust—especially when training models on copyrighted or restricted content. Ignoring blocks not only risks legal scrutiny but damages brand credibility.
This is more than a technical spat—it’s a trust crisis at the heart of AI ethics. If firms like Perplexity choose stealth over respect, they risk isolating publishers—and losing their legitimacy. AI’s future must not bypass consent or break foundational web rules.
Without adherence to clearly defined web norms, unchecked AI crawling threatens publishers’ rights and undermines content credibility. If AI startups want sustainable scale, they must earn trust—not simply bypass defenses.
Discover more from TechBooky
Subscribe to get the latest posts sent to your email.