Blocking AI bots used to be one line in robots.txt. In 2026, one line is how you either disappear from AI search or feed every training pipeline you wanted to opt out of.
Most websites still treat "AI bots" as one category. There are at least five functionally distinct categories of AI user agents hitting websites in 2026, and the access rules, identity mechanisms, and consequences differ sharply across them. Treating them as one bucket either over-blocks (and disappears from AI answers) or under-blocks (and feeds training pipelines the operator never consented to). Understanding which bots are visiting and what each one does is the infrastructure question that sits underneath every pillar of machine-first architecture: you cannot optimize your website for machines if you do not know which machines are showing up.
The scale is already meaningful. In March 2025, Cloudflare reported that AI crawlers were generating more than 50 billion requests per day across its network, or just under 1% of all web requests Cloudflare observed. By January 2026, Cloudflare's two-month analysis of Googlebot against other AI crawlers showed Googlebot reaching 1.70× more unique URLs than ClaudeBot, 1.76× more than GPTBot, 2.99× more than Meta-ExternalAgent, 3.26× more than Bingbot, 167× more than PerplexityBot, and 714× more than CCBot. The volume distribution is heavily skewed toward a few players, and the skew is shifting fast.
This article is the reference version of that taxonomy. Every user agent string, robots.txt stance, published IP range URL, and Web Bot Auth status below was verified against the primary vendor documentation cited in the entry. Where primary documentation does not exist, the entry says so explicitly. Two of the most commonly mentioned AI crawlers (Bytespider and xAI's Grok crawler) have no official vendor documentation page at all, which is itself useful information for operators making policy decisions.
This article complements two existing references on this website: The Agentic Browser Landscape in 2026 (the tools human users ride inside to reach the web through AI) and Google-Agent: The Web's New Visitor Just Got an Identity (a deep dive on one specific user-triggered fetcher and the Web Bot Auth concept it introduced).
Initial publication: April 13, 2026. Updated monthly as vendor documentation changes, new crawlers are introduced, and the IETF Web Bot Auth working group publishes new drafts.
GET WEEKLY WEB STRATEGY TIPS FOR THE AI AGE
Practical strategies for making your website work for AI agents and the humans using it. Podcast episodes, articles, videos. Plus exclusive tools, free for subscribers. No spam.
Contents
- Why "AI bots" is not one category
- The five categories at a glance
- Category 1: Training Crawlers
- Category 2: Search and Retrieval Crawlers
- Category 3: User-Triggered Fetchers
- Category 4: Opt-Out Tokens (Not Crawlers)
- Category 5: Undeclared and Masquerading Traffic
- The Identity Layer: Beyond User-Agent Strings
- A Note on llms.txt
- How to Identify AI Bots in Your Server Logs
- The Access Policy Matrix
- What This Means for Your Website
Why "AI bots" is not one category
A training crawler fetching content to improve GPT-6 has nothing in common with a user-triggered fetcher completing a one-off research task for a specific human who just typed a query into Gemini. They use different user agent strings, follow different rules, are operated by different teams, and have different consequences for the website owner. An operator who blocks both with the same directive is either losing AI search visibility they wanted (by blocking retrieval) or feeding training pipelines they wanted to opt out of (by allowing training crawlers that ignore the generic directive they relied on).
The root problem is that robots.txt was designed in 1994 for a world with one kind of crawler. AI traffic in 2026 has at least five, each with its own rules of engagement.
The five categories at a glance
| Category | Purpose | Example bots | Respects robots.txt | Visible in logs |
|---|---|---|---|---|
| 1. Training Crawlers | Fetch content to train LLMs | GPTBot, ClaudeBot, Amazonbot, Meta-ExternalAgent, CCBot | Vendor-dependent | Yes |
| 2. Search and Retrieval Crawlers | Fetch to build AI retrieval indexes | OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot | Yes (vendor-documented) | Yes |
| 3. User-Triggered Fetchers | Fetch when a specific human asks | Google-Agent, ChatGPT-User, Claude-User, Perplexity-User | Vendor-dependent | Yes |
| 4. Opt-Out Tokens | Robots.txt control directives | Google-Extended, Applebot-Extended | N/A (not crawlers) | No |
| 5. Undeclared and Masquerading | Scrape without identifying | Bytespider, xAI Grok, Copilot Actions | No (or unverifiable) | Only with active detection |
Two notes to read the table with:
- Respecting
robots.txtis a vendor-level property, not a category-level one. In the User-Triggered Fetchers row, Google's fetchers (including Google-Agent) ignorerobots.txt, while OpenAI's and Anthropic's fetchers respect it. The detail matters because the policy decision follows the vendor, not the category. - Opt-out tokens never appear in access logs.
Google-ExtendedandApplebot-Extendedare directives for existing crawlers, not new crawlers themselves. They make no HTTP requests. The distinction between opt-out tokens and actual crawlers is widely misreported by SEO listicles.
Category 1: Training Crawlers
Training crawlers fetch content to train large language models. Blocking them is how a website opts out of being part of future model capabilities. Allowing them contributes content to the pretraining corpora that power LLMs. Either decision has consequences, and the decision is separate from the question of whether the website should appear in AI search answers today.
The opt-out has become a mass movement. In August 2025, Cloudflare reported that over 2.5 million websites had chosen to completely disallow AI training through its managed robots.txt feature or its managed rule blocking AI crawlers. The 2.5 million figure is from August 2025 and has grown since.
GPTBot (OpenAI)
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) - Purpose: Training data collection for future OpenAI models
- Respects
robots.txt: Yes - Published IP ranges:
https://openai.com/gptbot.json - Primary source:
https://platform.openai.com/docs/bots
OpenAI runs three distinct bots (GPTBot, OAI-SearchBot, ChatGPT-User), each with its own IP JSON file and its own robots.txt control token. An operator who wants to allow search but block training can Disallow GPTBot only, and still appear in ChatGPT Search results served by OAI-SearchBot.
ClaudeBot (Anthropic)
- User-agent:
Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected]) - Purpose: Training data collection for Claude
- Respects
robots.txt: Yes. Anthropic's documentation: "Anthropic's Bots respect 'do not crawl' signals by honoring industry standard directives in robots.txt." Also supports the non-standardCrawl-delay. - Published IP ranges: None published. Anthropic recommends
robots.txtcontrol rather than IP allow-listing. - Primary source:
https://support.claude.com/en/articles/8896518
Anthropic consolidated its crawler fleet in 2024 to three bots: ClaudeBot (training), Claude-User (user-triggered), and Claude-SearchBot (retrieval indexing). The older anthropic-ai and claude-web user agent tokens are deprecated. robots.txt rules targeting those legacy strings are ineffective against current Claude traffic. Many SEO listicles still recommend blocking them.
Amazonbot
- User-agent:
Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 - Purpose: Indexing for Amazon services (Alexa, product intelligence) and LLM training for Amazon Nova
- Respects
robots.txt: Yes - Published IP ranges: Not published as JSON. Amazon documents reverse DNS verification as the recommended method.
- Primary source:
https://developer.amazon.com/amazonbot
Amazonbot uses a mechanism no other major AI crawler uses: a page-level noarchive meta tag as the training opt-out signal. An operator who allows Amazonbot but does not want content used for Nova training can add <meta name="robots" content="noarchive"> to individual pages. This approach has the advantage of working per page rather than per domain, and the disadvantage of being unique to Amazon, which means it cannot be part of a generic opt-out policy.
Meta-ExternalAgent
- User-agent:
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) - Purpose: Training and AI indexing for Meta's AI products across Facebook, Instagram, WhatsApp, and Messenger
- Respects
robots.txt: Yes - Published IP ranges: None published
- Primary source:
https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/
Do not confuse Meta-ExternalAgent with facebookexternalhit. The latter is Meta's long-running link-preview fetcher for Open Graph metadata when URLs are shared in Facebook posts or WhatsApp messages. It is not an AI bot, does not feed training, and blocking it breaks link previews across Meta's products. Many AI bot blocklists conflate them and break link sharing as a side effect.
CCBot (Common Crawl)
- User-agent:
CCBot/2.0 (https://commoncrawl.org/faq/) - Purpose: Open web corpus collection. Downstream consumers include most major LLM training datasets.
- Respects
robots.txt: Yes - Published IP ranges:
https://index.commoncrawl.org/ccbot.json - Primary source:
https://commoncrawl.org/ccbot
CCBot is worth understanding because blocking it has outsized consequences. Common Crawl is the upstream source for nearly every open LLM training dataset, including (historically) OpenAI's GPT-3 training corpus. A website that blocks CCBot opts out of being in the training data for most open-source and open-weight model releases, not just one vendor's. That is a larger opt-out than most operators realize.
Applebot
- User-agent:
Mozilla/5.0 ... (Applebot/Applebot_version; +http://www.apple.com/go/applebot) - Purpose: Indexing for Siri, Spotlight, Safari Suggestions, and Apple Intelligence generative features (the latter controlled via Applebot-Extended)
- Respects
robots.txt: Yes. Falls back to Googlebot directives when no Applebot-specific rules exist. - Published IP ranges: Published via a CIDR file linked from Apple's documentation. Reverse DNS of
*.applebot.apple.comis also valid for verification. - Primary source:
https://support.apple.com/en-us/119829
Applebot is a search/retrieval crawler by function, but its AI training role is controlled by the separate Applebot-Extended opt-out token (documented in Category 4). Blocking Applebot removes the website from Siri, Spotlight, and Safari Suggestions. Blocking Applebot-Extended only opts out of Apple Intelligence training while keeping search inclusion.
Other training crawlers worth knowing
- PerplexityBot (Perplexity) is documented by Perplexity as an indexing crawler, not a training crawler. Perplexity does not build foundation models. It is listed in Category 2.
- DuckAssistBot (DuckDuckGo) is documented as retrieval for DuckDuckGo's AI-assisted answers, explicitly not training. See Category 2.
- Diffbot is a commercial structured-data extraction service that respects
robots.txtby default but can be configured per customer. Used for Knowledge Graph construction and some training pipelines. - MistralAI-User (Mistral AI) is documented as a user-triggered retrieval fetcher, not a training crawler. See Category 3.
- ImagesiftBot (Hive) is an image-focused crawler whose documented purpose is image discovery for ImageSift reverse-image search. Its relationship to Hive's visual AI model training is not documented by the vendor.
- Bytespider (ByteDance) is widely reported as a training crawler but publishes no vendor documentation page. ByteDance has not confirmed its purpose,
robots.txtstance, or IP ranges. Treat as undocumented (Category 5). - Cohere's training crawler similarly has no vendor documentation page. Behavioral reports describe standard
robots.txtcompliance but the vendor has not confirmed.
Category 2: Search and Retrieval Crawlers
Search and retrieval crawlers fetch content to build an AI system's real-time answer index. Blocking them removes the website from the search results and cited answers those systems produce. Allowing them is how a website appears in ChatGPT Search, Claude retrieval, Perplexity answers, and other AI search surfaces.
Unlike training crawlers, retrieval crawlers are visibility infrastructure. Blocking them is a direct trade-off with AI search presence today.
OAI-SearchBot (OpenAI)
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot) - Purpose: Building the search index used in ChatGPT's search features. Not used for training.
- Respects
robots.txt: Yes - Published IP ranges:
https://openai.com/searchbot.json - Primary source:
https://platform.openai.com/docs/bots
OpenAI's documentation explicitly states that operators can allow OAI-SearchBot while disallowing GPTBot. That is the canonical pattern for an operator who wants ChatGPT Search presence without contributing content to foundation model training.
Claude-SearchBot (Anthropic)
- User-agent:
Claude-SearchBot - Purpose: Systematic indexing of publicly available content to improve Claude search quality
- Respects
robots.txt: Yes - Published IP ranges: None published
- Primary source:
https://support.claude.com/en/articles/8896518
Claude-SearchBot is independently controllable from ClaudeBot. An operator can allow Claude-SearchBot to appear in Claude's retrieval while disallowing ClaudeBot to keep content out of training data.
PerplexityBot (Perplexity)
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) - Purpose: Indexing for Perplexity answers and citations. Perplexity does not build foundation models.
- Respects
robots.txt: Per Perplexity's documentation, yes. Caveat: Cloudflare published independent evidence in August 2025 documenting that Perplexity uses undeclared crawlers that circumventrobots.txtdirectives. Cloudflare subsequently de-listed Perplexity from its Verified Bots program over the findings. - Published IP ranges:
https://www.perplexity.com/perplexitybot.json - Primary source:
https://docs.perplexity.ai/guides/bots
Perplexity is the clearest case of published vendor documentation and observed behavior diverging. Operators who care about a hard block should not rely on robots.txt alone for Perplexity traffic.
Google-CloudVertexBot
- User-agent substring:
Google-CloudVertexBot - Purpose: Fetches websites on site owners' request when building Vertex AI Agents for enterprise retrieval and grounding
- Respects
robots.txt: Yes - Published IP ranges: Covered by
https://developers.google.com/static/search/apis/ipranges/googlebot.json - Primary source:
https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
Google-CloudVertexBot is an enterprise crawler, separate from Googlebot and separate from Google-Agent. It fetches content for Vertex AI Agents a Google Cloud customer is building, typically when the site owner has opted in. Blocking it does not affect Google Search inclusion.
Bingbot
- User-agent (desktop):
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) - Purpose: Indexing for Bing Search and Microsoft Copilot retrieval
- Respects
robots.txt: Yes - Published IP ranges:
https://www.bing.com/toolbox/bingbot.json - Primary source:
https://www.bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0
Microsoft has not published a distinct Copilot-specific user agent. Bingbot feeds both Bing Search and Copilot's retrieval layer. A separate category is Microsoft's Copilot Actions (agentic automation), which uses standard Edge/Chromium browser user agents and no bot signal, making it invisible at the user-agent layer. That traffic belongs in Category 5.
DuckAssistBot (DuckDuckGo)
- User-agent:
Mozilla/5.0 (compatible; DuckAssistBot/1.0; +https://duckduckgo.com/duckassistbot) - Purpose: Retrieval for DuckDuckGo's AI-assisted answers. Explicitly not training.
- Respects
robots.txt: Yes. DuckDuckGo documentation states data "is not used in any way to train AI models." - Published IP ranges: Documented alongside DuckDuckBot
- Primary source:
https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot
Search and retrieval comparison
| Bot | Operator | Respects robots.txt | IP ranges published |
|---|---|---|---|
| OAI-SearchBot | OpenAI | Yes | Yes (/searchbot.json) |
| Claude-SearchBot | Anthropic | Yes | No |
| PerplexityBot | Perplexity | Documented yes, observed no | Yes (/perplexitybot.json) |
| Google-CloudVertexBot | Yes | Yes (under googlebot.json) | |
| Bingbot | Microsoft | Yes | Yes (bingbot.json) |
| DuckAssistBot | DuckDuckGo | Yes | Yes |
Category 3: User-Triggered Fetchers
A user-triggered fetcher is a bot that visits a specific URL because a specific human asked an AI system to visit it. The fetch is a proxy request on behalf of the person who typed a query or clicked a link inside an AI product, not an autonomous crawl.
This is the category where vendor policies diverge the most. The robots.txt position is vendor-level, not category-level. Read the subsections carefully before assuming that blocking one fetcher blocks them all.
Google-Agent and Google's full fetcher list
- Google-Agent user-agent: Standard Chrome desktop and mobile strings, with fetches originating from
user-triggered-agents.jsonIP ranges - Purpose: Agents on Google infrastructure navigate the web and perform actions on the user's behalf, including Project Mariner
- Respects
robots.txt: No. Google's documentation states verbatim: "Because the fetch was requested by a user, these fetchers generally ignore robots.txt rules." - Web Bot Auth: Experimental. Vendor quote: "Google is also experimenting with the
web-bot-authprotocol, using thehttps://agent.bot.googidentity." - Published IP ranges:
https://developers.google.com/static/crawling/ipranges/user-triggered-agents.json(a dedicated file, separate from other Google user-triggered fetchers) - Primary source:
https://developers.google.com/crawling/docs/crawlers-fetchers/google-user-triggered-fetchers
Google-Agent is the headline case in this category. It is the only Google fetcher with its own dedicated IP range file, the only Google bot with documented Web Bot Auth experimentation, and the archetypal example of a vendor that treats user-triggered fetches as browser-like requests that robots.txt does not apply to. I wrote about Google-Agent in depth in Google-Agent: The Web's New Visitor Just Got an Identity.
Google's user-triggered fetchers list is not limited to Google-Agent. The same robots.txt-ignoring policy covers twelve other bots at the time of writing, all documented on the same Google page:
| Bot | Purpose |
|---|---|
| Google-NotebookLM | Fetches URLs provided as NotebookLM project sources |
| Google-Read-Aloud | Fetches pages for text-to-speech playback |
| FeedFetcher-Google | Fetches RSS and Atom feeds for Google News and WebSub |
| Google-CWS | Fetches URLs from Chrome Web Store listings |
| GoogleMessages | Generates link previews in Google Messages |
| Google-Pinpoint | Fetches URLs for Pinpoint document collections |
| GoogleProducer | Processes publisher feeds for Google News landing pages |
| Google-Site-Verification | Fetches Search Console verification tokens |
All of them share the blanket rule: user-initiated, therefore robots.txt does not apply.
ChatGPT-User (OpenAI)
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot) - Purpose: Visits URLs when users ask ChatGPT or a CustomGPT to browse a specific page. Not used for training.
- Respects
robots.txt: Yes. OpenAI treats ChatGPT-User as an independently controllable bot. Disallowing it inrobots.txtprevents ChatGPT from browsing the website when a user asks. - Published IP ranges:
https://openai.com/chatgpt-user.json - Primary source:
https://platform.openai.com/docs/bots
Claude-User (Anthropic)
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; [email protected]) - Purpose: Fetches web pages when a Claude user asks a question that requires current information. Not used for training.
- Respects
robots.txt: Yes - Published IP ranges: None published
- Primary source:
https://support.claude.com/en/articles/8896518
Anthropic's position stands in direct contrast to Google's. Claude-User and Google-Agent perform functionally identical jobs, but Anthropic honors robots.txt and Google does not. An operator who wants to uniformly block user-triggered fetchers through robots.txt will succeed against Claude-User and fail against Google-Agent.
Perplexity-User
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user) - Purpose: Real-time retrieval during Perplexity user queries
- Respects
robots.txt: Per Perplexity's documentation, yes. Caveat: the same August 2025 Cloudflare investigation that flagged PerplexityBot also named Perplexity-User behavior as inconsistent with the stated policy. - Published IP ranges:
https://www.perplexity.com/perplexity-user.json - Primary source:
https://docs.perplexity.ai/guides/bots
MistralAI-User (Mistral AI)
- User-agent:
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots) - Purpose: User-triggered retrieval when a Le Chat user issues a query requiring real-time data
- Respects
robots.txt: Yes, per Mistral documentation - Published IP ranges: None published
- Primary source:
https://docs.mistral.ai/robots
The user-triggered fetcher takeaway
The key finding for this category, worth stating as a standalone sentence for future reference:
Respecting robots.txt in user-triggered fetchers is a vendor-level policy, not a category-level one. Google ignores robots.txt across its user-triggered fetchers as a blanket rule. Anthropic and OpenAI respect it. Perplexity claims to and has been documented as inconsistent. Mistral respects it. The practical consequence for an operator is that a uniform "block user-triggered fetchers" policy through robots.txt only partially works, and the parts it does not cover require server-side controls.
Category 4: Opt-Out Tokens (Not Crawlers)
Two commonly mentioned "AI bots" are not crawlers at all. They are robots.txt control tokens, which means they never issue HTTP requests, never appear in access logs, and cannot be blocked at the firewall or CDN level. They exist only as User-agent strings inside robots.txt files.
The distinction between opt-out tokens and actual crawlers matters because most SEO listicles treat Google-Extended and Applebot-Extended as if they were crawlers, and operators who go looking for those names in their logs will never find them.
Google-Extended
- What it is: A
robots.txtcontrol token, not a crawler. - Vendor quote: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity."
- What it controls: Whether content crawled by Google's existing crawlers may be used for training Gemini models and for grounding in Gemini Apps and Vertex AI generative features.
- Does not affect: Google Search inclusion or ranking. Google states explicitly that
Google-Extendedhas no search impact. - Launch: September 28, 2023
- Primary source:
https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers
Example robots.txt entry:
User-agent: Google-Extended
Disallow: /
Applebot-Extended
- What it is: A
robots.txtcontrol token, not a crawler. - Vendor quote: "Applebot-Extended does not crawl webpages. Instead, Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent."
- What it controls: Whether content crawled by Applebot may be used to train Apple's foundation models powering Apple Intelligence.
- Does not affect: Siri, Spotlight, or Safari Suggestions inclusion. Apple states that webpages disallowing
Applebot-Extendedstill appear in search results. - Launch: 2024, alongside Apple's Apple Intelligence announcement
- Primary source:
https://support.apple.com/en-us/119829
Example robots.txt entry:
User-agent: Applebot-Extended
Disallow: /
The critical distinction
Google-Extended and Applebot-Extended are the AI training opt-out layer for the two largest search-and-AI platforms. They are the mechanism for an operator who wants to stay visible in search (Google Search, Apple Spotlight) while opting out of training the generative AI features those same vendors ship. They are architecturally different from every other entry in this article because they do not describe traffic, they describe intent.
A common operator mistake is adding a Disallow: / for these tokens and then assuming the website is protected from all Google and Apple AI crawling. It is not. The tokens only control the training use of content crawled by the underlying search crawlers. Traffic from Google-Agent, Google-CloudVertexBot, or any Google fetcher category is governed by different rules in different places.
Category 5: Undeclared and Masquerading Traffic
The first four categories describe bots that identify themselves. Category 5 describes traffic that does not, or that claims to be something it is not. This is the hardest category to manage because the operator cannot make a policy decision about a bot they cannot reliably identify.
Bytespider (ByteDance)
ByteDance operates a crawler widely reported to train Doubao and other ByteDance LLMs. The commonly cited user agent is Mozilla/5.0 (compatible; Bytespider; [email protected]), but ByteDance has not published a vendor documentation page describing Bytespider's purpose, robots.txt position, or IP ranges. Behavioral reports from CDN operators describe robots.txt non-compliance and user agent spoofing.
The operational reality: an operator who wants to block Bytespider cannot rely on vendor cooperation. Blocks typically happen at the WAF or CDN layer using behavioral signals rather than robots.txt directives.
xAI's Grok crawler
xAI publishes no crawler documentation page. Multiple behavioral reports describe Grok's retrieval traffic as using residential IP rotation and spoofed Safari or Chrome user agents, making it functionally indistinguishable from a human visitor at the UA layer. No JSON IP range file exists. No robots.txt contract exists. From the website operator's perspective, Grok fetches look like unidentified browser traffic, with no AI bot signal to act on.
xAI's Grok operation is a high-profile example of why Category 5 needs to exist as a category at all. A comprehensive AI bot policy that relies on user-agent classification will have a blind spot the size of xAI's entire crawling operation.
Microsoft Copilot Actions
Microsoft's Copilot Actions (the agentic automation feature in Microsoft 365 Copilot) uses standard Edge or Chromium browser user agents with no bot signal. Microsoft has not published a distinct Copilot-specific user agent for agentic traffic. From the operator's log perspective, Copilot Actions traffic looks like a regular Edge user.
Microsoft is not alone in this. Every agentic browser covered in The Agentic Browser Landscape in 2026 that operates by driving a real browser session (Claude for Chrome, Perplexity Comet, ChatGPT Atlas, Opera Neon, and others) produces traffic that looks like the underlying browser's user agent. The AI agent is present, but the HTTP layer is silent about it.
Firecrawl and multi-tenant scraping services
Firecrawl is a scraper-as-a-service platform. Its default FirecrawlAgent user agent respects robots.txt, but Firecrawl is multi-tenant: many different customers' agents and applications drive the same crawler. Blocking FirecrawlAgent blocks every downstream customer simultaneously, which may include legitimate agentic tools an operator actually wants to reach. The category is a policy trap more than a technical one.
Category 5 is about detection, not blocking
An operator dealing with Category 5 traffic is no longer in the land of robots.txt directives. The tools are IP reputation, behavioral detection, rate limiting, CAPTCHAs, server-side authentication, and (when the operator wants the traffic) cryptographic identity verification via Web Bot Auth. That last one is where the identity layer is heading.
The Identity Layer: Beyond User-Agent Strings
User agent strings can be spoofed by anyone. For the Categories 1 through 4 that publish documentation, operators can combine the declared UA string with reverse DNS verification or IP allow-listing against published JSON files. For Category 5 traffic, that method fails.
Before a website can optimize for machines (which is the whole premise of machine-first architecture), it needs to know which machines are visiting. Three mechanisms are converging to solve that identification problem in a way that survives spoofing.
Web Bot Auth
Web Bot Auth is an IETF draft protocol (draft-meunier-web-bot-auth-architecture, currently at version -05 published March 2, 2026, authored by Thibault Meunier of Cloudflare and Sandor Major of Google) that applies HTTP Message Signatures (RFC 9421) to automated traffic. Each bot operator generates an Ed25519 keypair, publishes the public key as a JWKS at /.well-known/http-message-signatures-directory on a domain the operator controls, and signs every outbound HTTP request. Websites verify the signature and know, with cryptographic certainty, that the visitor came from the claimed operator.
An IETF Working Group was chartered in early 2026 specifically for this work. The webbotauth WG has published milestones targeting standards-track specifications for authentication techniques and bot information mechanisms by April 2026, and a Best Current Practice operational document by August 2026. The protocol is moving from individual draft to standards-track faster than most IETF work.
Vendor support verified from primary sources:
- Google: Experimenting. Vendor quote: "Google is also experimenting with the
web-bot-authprotocol, using thehttps://agent.bot.googidentity." Applied specifically to Google-Agent at the time of writing. - Cloudflare: Fully implemented. Web Bot Auth is integrated into Cloudflare's Verified Bots program, with signature verification happening at the Cloudflare edge. Cloudflare publishes the reference implementation at
github.com/cloudflare/web-bot-authin Rust and TypeScript. - Akamai: Supports Web Bot Auth signature verification on the Akamai edge.
- Amazon Bedrock AgentCore Browser: Preview support announced October 2025. AgentCore Browser signs outbound HTTP requests with a private key when Web Bot Auth is enabled.
- AWS + Cloudflare joint key registry: Announced February 2026. A collaborative open format for publishing agent public keys at scale.
OpenAI has been named as an ecosystem partner in Akamai and AWS documentation describing Web Bot Auth, but has not published its own vendor documentation confirming adoption at the time of writing. Anthropic, Perplexity, and Mistral have not documented Web Bot Auth support.
Verified bot directories
For operators who do not run their own signature verification, CDN-level bot directories are the practical fallback.
- Cloudflare Verified Bots is a curated list of bots Cloudflare approves and attaches identity to at the edge. Available as the
cf.verified_botboolean andcf.verified_bot_categorystring for use in WAF and rate limiting rules. Historically IP-based; as of 2025, Web Bot Auth signatures are a first-class verification method. - Akamai Bot Manager classifies bots in the "agentic era" with intent and identity tags that go beyond good/bad, available to Akamai customers.
- Known Agents (the service formerly known as Dark Visitors, at
knownagents.com) is a community-curated catalog of AI bot user agents with auto-generatedrobots.txtblocks. Not an authoritative source in the way Cloudflare and Akamai are, but the most comprehensive public catalog for operators who want a starting list.
Reverse DNS and IP verification (the old way, still useful)
For bots that publish IP ranges, reverse DNS with forward confirmation remains the baseline verification method. The pattern is: take the visitor's IP, look up the PTR record, confirm it resolves to a vendor-controlled domain (e.g., *.googlebot.com or *.applebot.apple.com), then do a forward A or AAAA lookup on that hostname and confirm it matches the original IP. This protects against UA spoofing and has been the standard verification technique for over a decade.
A Note on llms.txt
Discussion of AI user agents eventually reaches llms.txt, the proposed convention for Markdown files at /llms.txt that describe a website's content to AI systems. This section is deliberately honest about where llms.txt stands in April 2026.
What llms.txt is
llms.txt is a proposal by Jeremy Howard of Answer.AI, published September 3, 2024. The specification lives at llmstxt.org and in the AnswerDotAI/llms-txt GitHub repository. The format is a Markdown file at the root of a domain, with an H1 site name (required), a blockquote summary, and H2 sections containing link lists that describe the website's key resources. A companion llms-full.txt convention is a single consolidated Markdown dump of an entire documentation set intended for direct ingestion into an LLM context window.
What llms.txt is not
llms.txt is not an IETF draft. It is not a W3C recommendation. It is not endorsed by any standards body. It is an informal community proposal that has remained in the same status for nineteen months.
Who publishes llms.txt
Adoption on the publishing side is real. Anthropic publishes llms.txt and llms-full.txt at docs.claude.com. OpenAI publishes at platform.openai.com/docs/llms.txt. Perplexity publishes at docs.perplexity.ai/llms-full.txt. Cloudflare publishes a large product-segmented llms.txt. Mintlify, the documentation platform, generates llms.txt automatically for all hosted documentation sites, which is the single largest driver of publish-side adoption. Google added an llms.txt to its own developer documentation in December 2025.
Who reads llms.txt
As of April 2026, no major LLM vendor has documented that their crawlers or retrieval systems consume llms.txt files from external websites. Publishing llms.txt is currently a symbolic gesture, not an operational control.
Google has explicitly said otherwise. Both Gary Illyes and John Mueller, Googlers who speak publicly about crawler behavior, have stated during 2025 that Google does not support llms.txt and that no AI system currently uses it as an operational signal. Those statements are the most-cited official Google position on the format and have not been walked back.
OpenAI, Anthropic, Perplexity, Meta, and Mistral have all published llms.txt files for their own documentation but have not committed in vendor documentation to reading the format from other websites. Observational data from CDN providers shows minimal crawler traffic to llms.txt files on third-party websites.
Should you publish llms.txt anyway?
The honest answer is that publishing llms.txt costs almost nothing and protects against the possibility that the format does eventually get picked up by one or more vendors. Treat it the same way you treated early sitemap.xml: a cheap signal that may or may not matter, worth deploying as a hedge rather than as a load-bearing SEO tactic.
Do not rely on llms.txt as an AI access control mechanism. It is not one, and the vendors that operate the crawlers have either said nothing or said no.
This website publishes an llms.txt and an llms-full.txt. Neither is believed to have material effect on AI retrieval in 2026.
How to Identify AI Bots in Your Server Logs
Three practical techniques for operators who want to start measuring AI agent traffic today.
1. Grep the user agent string
Most declared AI bots include a recognizable substring. Searching access logs for any of the following catches most Categories 1 through 3 traffic:
GPTBot|OAI-SearchBot|ChatGPT-User
ClaudeBot|Claude-User|Claude-SearchBot
PerplexityBot|Perplexity-User
Google-Agent|Google-NotebookLM|Google-Read-Aloud|Google-CloudVertexBot
Amazonbot|CCBot|Applebot
meta-externalagent
DuckAssistBot
MistralAI-User
bingbot
The pattern captures the declared AI user agents. It does not catch Category 5 traffic.
2. Cross-reference IP ranges
For higher-confidence identification, compare the visitor's IP against the vendor's published JSON IP range file. OpenAI, Google, Common Crawl, Perplexity, Bing, and others publish machine-readable ranges. This is the canonical verification technique for operators who want to be certain a request claiming to be GPTBot actually came from OpenAI.
3. Reverse DNS with forward confirmation
For vendors that do not publish IP range files (Anthropic is the notable case), reverse DNS is the fallback. dig -x <ip> returns a PTR record. If the hostname belongs to the claimed vendor (e.g., *.googlebot.com or *.applebot.apple.com), a forward lookup on the hostname should return the original IP. If it does not, the request is a spoof.
The Access Policy Matrix
This section does not tell you what to allow or block. Operators have different threat models, business models, and competitive positions. It gives the framework for making an informed decision.
Four questions to ask for every category:
- Does this traffic bring business value? Retrieval crawlers bring AI search visibility, which may be directly measurable through referral traffic from ChatGPT, Perplexity, and others. Training crawlers bring no immediate value but contribute to long-term model capability. User-triggered fetchers bring one-off task completion value from specific users who sent an AI to the website on purpose. Undeclared traffic brings neither value nor accountability.
- Does the vendor respect your stated wishes? Most declared training and retrieval crawlers respect
robots.txt. User-triggered fetchers do not, uniformly. Category 5 traffic ignoresrobots.txtby definition. - Is the vendor transparent about what they do with the data? Vendors with documented purposes, IP range files, and crawler contact addresses are accountable. Vendors with no documentation are not. That transparency itself is policy-relevant.
- What does blocking cost you? Blocking OAI-SearchBot removes the website from ChatGPT Search. Blocking Claude-User prevents Claude from completing research tasks users asked it to perform on your behalf. Blocking CCBot removes the website from the upstream dataset that trains most open-source LLMs. Every block has a cost, and the cost is different per category.
A defensible default for most businesses is to allow Category 2 (search and retrieval crawlers), make an explicit decision on Category 1 (training crawlers) based on content policy, rely on Category 4 tokens (Google-Extended, Applebot-Extended) to cleanly opt out of Google and Apple training while keeping search inclusion, accept that Category 3 fetchers that ignore robots.txt require server-side controls or nothing at all, and accept that Category 5 is a detection problem rather than a robots.txt problem. A sample starter robots.txt:
# Allow AI search and retrieval crawlers
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
# Opt out of Google and Apple generative AI training
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
# Explicit decisions on training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
The above is a starting point for an operator who wants AI search visibility without contributing to foundation model training. Operators with different priorities should change it. The point is that the decision is per-category, not global.
What This Means for Your Website
The practical actions for an operator in April 2026 are:
- Start logging AI bot traffic now. Whatever policy you settle on, it should be informed by real data about which bots are visiting and how often. Every published IP range and documented UA string above is a starting point for instrumentation.
- Stop treating "AI bots" as one category. The policy decision for GPTBot is not the policy decision for Google-Agent is not the policy decision for Bytespider. One rule in
robots.txtcannot make three different decisions. - Decide about training separately from decisions about retrieval. Google-Extended and Applebot-Extended exist specifically so those decisions can be separated. Use them.
- Accept that
robots.txtno longer carries the full load. For Google user-triggered fetchers, Category 5 traffic, and any operator who needs a hard block that cannot be bypassed, server-side controls are the answer.robots.txtis a signal, not a contract, and for roughly half the AI traffic in 2026, it is a signal that will be ignored. - Watch the identity layer. Web Bot Auth is where verification is heading. Within the next twelve months, cryptographic agent identity will move from Google and Cloudflare experiments to a standards-track protocol with multi-vendor support. Operators who start tracking
Signature-Agentheaders today will be ready for the shift.
The web now has a machine-readable identity layer in addition to its human-readable one. The identities are still being issued, the standards are still being written, and the policies are still being decided. The only wrong move is pretending it's still 2023 and one Disallow: / covers everything.

