What are the different categories of AI user agents?

There are five functionally distinct categories of AI user agents visiting websites in 2026: training crawlers (fetch content to train LLMs, such as GPTBot and ClaudeBot), search and retrieval crawlers (fetch to build AI retrieval indexes, such as OAI-SearchBot and Claude-SearchBot), user-triggered fetchers (fetch when a human asks, such as Google-Agent and ChatGPT-User), opt-out tokens (robots.txt directives that control training, such as Google-Extended and Applebot-Extended, and never appear in logs), and undeclared or masquerading traffic (scrapers that spoof browser identities, such as xAI's Grok crawler and Bytespider).

Do AI bots respect robots.txt?

Respecting robots.txt is vendor-specific, not category-specific. OpenAI's GPTBot, ChatGPT-User, and OAI-SearchBot respect robots.txt. Anthropic's ClaudeBot, Claude-User, and Claude-SearchBot respect robots.txt. Google's user-triggered fetchers, including Google-Agent and Google NotebookLM, explicitly ignore robots.txt because Google treats them as user proxies. Perplexity-User claims to respect robots.txt but was documented by Cloudflare in August 2025 as using stealth crawlers that circumvent it. Bytespider and xAI's Grok crawler publish no vendor documentation at all.

What is Web Bot Auth?

Web Bot Auth is an IETF draft protocol (draft-meunier-web-bot-auth-architecture) for cryptographic verification of automated traffic using HTTP Message Signatures (RFC 9421). A bot signs each outbound request with an Ed25519 private key, publishes its public key as a JWKS at a /.well-known/http-message-signatures-directory URL, and includes a Signature-Agent header identifying the signing domain. Websites can then verify requests came from the operator they claim to come from. An IETF working group was chartered in early 2026 with milestones for a standards-track specification due April 2026 and a best current practice document due August 2026. Google, Cloudflare, Akamai, and Amazon Bedrock AgentCore Browser all support it in varying degrees.

Does publishing llms.txt do anything in 2026?

As of April 2026, publishing llms.txt is a symbolic gesture, not an operational control. llms.txt was proposed by Jeremy Howard of Answer.AI in September 2024 and remains an informal community proposal, not an IETF draft or standards-body recommendation. No major LLM vendor has documented that their crawlers read llms.txt files from external websites. Google has explicitly stated otherwise: John Mueller in June 2025 said 'FWIW no AI system currently uses llms.txt.' Anthropic, OpenAI, Perplexity, Meta, and Mistral have published llms.txt files for their own documentation but have not committed to reading the format from other websites.

How do I identify AI bot traffic in my server logs?

Start with the user-agent string. Most AI bots identify themselves with a recognizable substring like GPTBot, ClaudeBot, PerplexityBot, Google-Agent, or CCBot. For verified identification, compare the visitor's IP address against the vendor's published IP range JSON file (OpenAI publishes /gptbot.json, /searchbot.json, and /chatgpt-user.json; Google publishes user-triggered-agents.json and user-triggered-fetchers.json; Common Crawl publishes ccbot.json). Reverse DNS verification with forward confirmation is the classic fallback where IP lists are not published. User-agent strings alone can be spoofed.

Should I block AI crawlers from my website?

The decision depends on the category. Blocking training crawlers opts content out of future model training, which is a privacy and IP protection decision. Blocking search and retrieval crawlers removes the website from AI answers, which is a visibility decision. Blocking user-triggered fetchers prevents AI assistants from completing user requests on the website, which is an access decision. One robots.txt rule cannot make all three decisions correctly. A defensible starting policy for most businesses is to allow search and retrieval crawlers, opt out of training via Google-Extended, Applebot-Extended, and explicit disallows on GPTBot and ClaudeBot if training consent matters, and accept that Google-Agent and other ignore-robots.txt fetchers require server-side controls if they need to be restricted.

The AI User-Agent Landscape in 2026: A Complete Reference

Blocking AI bots used to be one line in robots.txt. In 2026, one line is how you either disappear from AI search or feed every training pipeline you wanted to opt out of.

Most websites still treat "AI bots" as one category. There are at least five functionally distinct categories of AI user agents hitting websites in 2026, and the access rules, identity mechanisms, and consequences differ sharply across them. Treating them as one bucket either over-blocks (and disappears from AI answers) or under-blocks (and feeds training pipelines the operator never consented to). Understanding which bots are visiting and what each one does is the infrastructure question that sits underneath every pillar of machine-first architecture: you cannot optimize your website for machines if you do not know which machines are showing up.

The scale is already meaningful. In March 2025, Cloudflare reported that AI crawlers were generating more than 50 billion requests per day across its network, or just under 1% of all web requests Cloudflare observed. By January 2026, Cloudflare's two-month analysis of Googlebot against other AI crawlers showed Googlebot reaching 1.70× more unique URLs than ClaudeBot, 1.76× more than GPTBot, 2.99× more than Meta-ExternalAgent, 3.26× more than Bingbot, 167× more than PerplexityBot, and 714× more than CCBot. The volume distribution is heavily skewed toward a few players, and the skew is shifting fast.

This article is the reference version of that taxonomy. Every user agent string, robots.txt stance, published IP range URL, and Web Bot Auth status below was verified against the primary vendor documentation cited in the entry. Where primary documentation does not exist, the entry says so explicitly. Two of the most commonly mentioned AI crawlers (Bytespider and xAI's Grok crawler) have no official vendor documentation page at all, which is itself useful information for operators making policy decisions.

This article complements two existing references on this website: The Agentic Browser Landscape in 2026 (the tools human users ride inside to reach the web through AI) and Google-Agent: The Web's New Visitor Just Got an Identity (a deep dive on one specific user-triggered fetcher and the Web Bot Auth concept it introduced).

Initial publication: April 13, 2026. Updated monthly as vendor documentation changes, new crawlers are introduced, and the IETF Web Bot Auth working group publishes new drafts.

GET WEEKLY WEB STRATEGY TIPS FOR THE AI AGE

Practical strategies for making your website work for AI agents and the humans using it. Podcast episodes, articles, videos. Plus exclusive tools, free for subscribers. No spam.

Why "AI bots" is not one category
The five categories at a glance
Category 1: Training Crawlers
Category 2: Search and Retrieval Crawlers
Category 3: User-Triggered Fetchers
Category 4: Opt-Out Tokens (Not Crawlers)
Category 5: Undeclared and Masquerading Traffic
The Identity Layer: Beyond User-Agent Strings
A Note on llms.txt
How to Identify AI Bots in Your Server Logs
The Access Policy Matrix
What This Means for Your Website

Why "AI bots" is not one category

A training crawler fetching content to improve GPT-6 has nothing in common with a user-triggered fetcher completing a one-off research task for a specific human who just typed a query into Gemini. They use different user agent strings, follow different rules, are operated by different teams, and have different consequences for the website owner. An operator who blocks both with the same directive is either losing AI search visibility they wanted (by blocking retrieval) or feeding training pipelines they wanted to opt out of (by allowing training crawlers that ignore the generic directive they relied on).

The root problem is that robots.txt was designed in 1994 for a world with one kind of crawler. AI traffic in 2026 has at least five, each with its own rules of engagement.

The five categories at a glance

Category	Purpose	Example bots	Respects robots.txt	Visible in logs
1. Training Crawlers	Fetch content to train LLMs	GPTBot, ClaudeBot, Amazonbot, Meta-ExternalAgent, CCBot	Vendor-dependent	Yes
2. Search and Retrieval Crawlers	Fetch to build AI retrieval indexes	OAI-SearchBot, Claude-SearchBot, PerplexityBot, Bingbot	Yes (vendor-documented)	Yes
3. User-Triggered Fetchers	Fetch when a specific human asks	Google-Agent, ChatGPT-User, Claude-User, Perplexity-User	Vendor-dependent	Yes
4. Opt-Out Tokens	Robots.txt control directives	Google-Extended, Applebot-Extended	N/A (not crawlers)	No
5. Undeclared and Masquerading	Scrape without identifying	Bytespider, xAI Grok, Copilot Actions	No (or unverifiable)	Only with active detection

Two notes to read the table with:

Respecting robots.txt is a vendor-level property, not a category-level one. In the User-Triggered Fetchers row, Google's fetchers (including Google-Agent) ignore robots.txt, while OpenAI's and Anthropic's fetchers respect it. The detail matters because the policy decision follows the vendor, not the category.
Opt-out tokens never appear in access logs. Google-Extended and Applebot-Extended are directives for existing crawlers, not new crawlers themselves. They make no HTTP requests. The distinction between opt-out tokens and actual crawlers is widely misreported by SEO listicles.

Category 1: Training Crawlers

Training crawlers fetch content to train large language models. Blocking them is how a website opts out of being part of future model capabilities. Allowing them contributes content to the pretraining corpora that power LLMs. Either decision has consequences, and the decision is separate from the question of whether the website should appear in AI search answers today.

The opt-out has become a mass movement. In August 2025, Cloudflare reported that over 2.5 million websites had chosen to completely disallow AI training through its managed robots.txt feature or its managed rule blocking AI crawlers. The 2.5 million figure is from August 2025 and has grown since.

GPTBot (OpenAI)

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot)
Purpose: Training data collection for future OpenAI models
Respects robots.txt: Yes
Published IP ranges: https://openai.com/gptbot.json
Primary source: https://platform.openai.com/docs/bots

OpenAI runs three distinct bots (GPTBot, OAI-SearchBot, ChatGPT-User), each with its own IP JSON file and its own robots.txt control token. An operator who wants to allow search but block training can Disallow GPTBot only, and still appear in ChatGPT Search results served by OAI-SearchBot.

ClaudeBot (Anthropic)

User-agent: Mozilla/5.0 (compatible; ClaudeBot/1.0; [email protected])
Purpose: Training data collection for Claude
Respects robots.txt: Yes. Anthropic's documentation: "Anthropic's Bots respect 'do not crawl' signals by honoring industry standard directives in robots.txt." Also supports the non-standard Crawl-delay.
Published IP ranges: None published. Anthropic recommends robots.txt control rather than IP allow-listing.
Primary source: https://support.claude.com/en/articles/8896518

Anthropic consolidated its crawler fleet in 2024 to three bots: ClaudeBot (training), Claude-User (user-triggered), and Claude-SearchBot (retrieval indexing). The older anthropic-ai and claude-web user agent tokens are deprecated. robots.txt rules targeting those legacy strings are ineffective against current Claude traffic. Many SEO listicles still recommend blocking them.

Amazonbot

User-agent: Mozilla/5.0 (compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36
Purpose: Indexing for Amazon services (Alexa, product intelligence) and LLM training for Amazon Nova
Respects robots.txt: Yes
Published IP ranges: Not published as JSON. Amazon documents reverse DNS verification as the recommended method.
Primary source: https://developer.amazon.com/amazonbot

Amazonbot uses a mechanism no other major AI crawler uses: a page-level noarchive meta tag as the training opt-out signal. An operator who allows Amazonbot but does not want content used for Nova training can add <meta name="robots" content="noarchive"> to individual pages. This approach has the advantage of working per page rather than per domain, and the disadvantage of being unique to Amazon, which means it cannot be part of a generic opt-out policy.

Meta-ExternalAgent

User-agent: meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
Purpose: Training and AI indexing for Meta's AI products across Facebook, Instagram, WhatsApp, and Messenger
Respects robots.txt: Yes
Published IP ranges: None published
Primary source: https://developers.facebook.com/docs/sharing/webmasters/web-crawlers/

Do not confuse Meta-ExternalAgent with facebookexternalhit. The latter is Meta's long-running link-preview fetcher for Open Graph metadata when URLs are shared in Facebook posts or WhatsApp messages. It is not an AI bot, does not feed training, and blocking it breaks link previews across Meta's products. Many AI bot blocklists conflate them and break link sharing as a side effect.

CCBot (Common Crawl)

User-agent: CCBot/2.0 (https://commoncrawl.org/faq/)
Purpose: Open web corpus collection. Downstream consumers include most major LLM training datasets.
Respects robots.txt: Yes
Published IP ranges: https://index.commoncrawl.org/ccbot.json
Primary source: https://commoncrawl.org/ccbot

CCBot is worth understanding because blocking it has outsized consequences. Common Crawl is the upstream source for nearly every open LLM training dataset, including (historically) OpenAI's GPT-3 training corpus. A website that blocks CCBot opts out of being in the training data for most open-source and open-weight model releases, not just one vendor's. That is a larger opt-out than most operators realize.

Applebot

User-agent: Mozilla/5.0 ... (Applebot/Applebot_version; +http://www.apple.com/go/applebot)
Purpose: Indexing for Siri, Spotlight, Safari Suggestions, and Apple Intelligence generative features (the latter controlled via Applebot-Extended)
Respects robots.txt: Yes. Falls back to Googlebot directives when no Applebot-specific rules exist.
Published IP ranges: Published via a CIDR file linked from Apple's documentation. Reverse DNS of *.applebot.apple.com is also valid for verification.
Primary source: https://support.apple.com/en-us/119829

Applebot is a search/retrieval crawler by function, but its AI training role is controlled by the separate Applebot-Extended opt-out token (documented in Category 4). Blocking Applebot removes the website from Siri, Spotlight, and Safari Suggestions. Blocking Applebot-Extended only opts out of Apple Intelligence training while keeping search inclusion.

Other training crawlers worth knowing

PerplexityBot (Perplexity) is documented by Perplexity as an indexing crawler, not a training crawler. Perplexity does not build foundation models. It is listed in Category 2.
DuckAssistBot (DuckDuckGo) is documented as retrieval for DuckDuckGo's AI-assisted answers, explicitly not training. See Category 2.
Diffbot is a commercial structured-data extraction service that respects robots.txt by default but can be configured per customer. Used for Knowledge Graph construction and some training pipelines.
MistralAI-User (Mistral AI) is documented as a user-triggered retrieval fetcher, not a training crawler. See Category 3.
ImagesiftBot (Hive) is an image-focused crawler whose documented purpose is image discovery for ImageSift reverse-image search. Its relationship to Hive's visual AI model training is not documented by the vendor.
Bytespider (ByteDance) is widely reported as a training crawler but publishes no vendor documentation page. ByteDance has not confirmed its purpose, robots.txt stance, or IP ranges. Treat as undocumented (Category 5).
Cohere's training crawler similarly has no vendor documentation page. Behavioral reports describe standard robots.txt compliance but the vendor has not confirmed.

Category 2: Search and Retrieval Crawlers

Search and retrieval crawlers fetch content to build an AI system's real-time answer index. Blocking them removes the website from the search results and cited answers those systems produce. Allowing them is how a website appears in ChatGPT Search, Claude retrieval, Perplexity answers, and other AI search surfaces.

Unlike training crawlers, retrieval crawlers are visibility infrastructure. Blocking them is a direct trade-off with AI search presence today.

OAI-SearchBot (OpenAI)

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot)
Purpose: Building the search index used in ChatGPT's search features. Not used for training.
Respects robots.txt: Yes
Published IP ranges: https://openai.com/searchbot.json
Primary source: https://platform.openai.com/docs/bots

OpenAI's documentation explicitly states that operators can allow OAI-SearchBot while disallowing GPTBot. That is the canonical pattern for an operator who wants ChatGPT Search presence without contributing content to foundation model training.

Claude-SearchBot (Anthropic)

User-agent: Claude-SearchBot
Purpose: Systematic indexing of publicly available content to improve Claude search quality
Respects robots.txt: Yes
Published IP ranges: None published
Primary source: https://support.claude.com/en/articles/8896518

Claude-SearchBot is independently controllable from ClaudeBot. An operator can allow Claude-SearchBot to appear in Claude's retrieval while disallowing ClaudeBot to keep content out of training data.

PerplexityBot (Perplexity)

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)
Purpose: Indexing for Perplexity answers and citations. Perplexity does not build foundation models.
Respects robots.txt: Per Perplexity's documentation, yes. Caveat: Cloudflare published independent evidence in August 2025 documenting that Perplexity uses undeclared crawlers that circumvent robots.txt directives. Cloudflare subsequently de-listed Perplexity from its Verified Bots program over the findings.
Published IP ranges: https://www.perplexity.com/perplexitybot.json
Primary source: https://docs.perplexity.ai/guides/bots

Perplexity is the clearest case of published vendor documentation and observed behavior diverging. Operators who care about a hard block should not rely on robots.txt alone for Perplexity traffic.

Google-CloudVertexBot

User-agent substring: Google-CloudVertexBot
Purpose: Fetches websites on site owners' request when building Vertex AI Agents for enterprise retrieval and grounding
Respects robots.txt: Yes
Published IP ranges: Covered by https://developers.google.com/static/search/apis/ipranges/googlebot.json
Primary source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers

Google-CloudVertexBot is an enterprise crawler, separate from Googlebot and separate from Google-Agent. It fetches content for Vertex AI Agents a Google Cloud customer is building, typically when the site owner has opted in. Blocking it does not affect Google Search inclusion.

Bingbot

User-agent (desktop): Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
Purpose: Indexing for Bing Search and Microsoft Copilot retrieval
Respects robots.txt: Yes
Published IP ranges: https://www.bing.com/toolbox/bingbot.json
Primary source: https://www.bing.com/webmasters/help/which-crawlers-does-bing-use-8c184ec0

Microsoft has not published a distinct Copilot-specific user agent. Bingbot feeds both Bing Search and Copilot's retrieval layer. A separate category is Microsoft's Copilot Actions (agentic automation), which uses standard Edge/Chromium browser user agents and no bot signal, making it invisible at the user-agent layer. That traffic belongs in Category 5.

DuckAssistBot (DuckDuckGo)

User-agent: Mozilla/5.0 (compatible; DuckAssistBot/1.0; +https://duckduckgo.com/duckassistbot)
Purpose: Retrieval for DuckDuckGo's AI-assisted answers. Explicitly not training.
Respects robots.txt: Yes. DuckDuckGo documentation states data "is not used in any way to train AI models."
Published IP ranges: Documented alongside DuckDuckBot
Primary source: https://duckduckgo.com/duckduckgo-help-pages/results/duckassistbot

Search and retrieval comparison

Bot	Operator	Respects `robots.txt`	IP ranges published
OAI-SearchBot	OpenAI	Yes	Yes (`/searchbot.json`)
Claude-SearchBot	Anthropic	Yes	No
PerplexityBot	Perplexity	Documented yes, observed no	Yes (`/perplexitybot.json`)
Google-CloudVertexBot	Google	Yes	Yes (under googlebot.json)
Bingbot	Microsoft	Yes	Yes (`bingbot.json`)
DuckAssistBot	DuckDuckGo	Yes	Yes

Category 3: User-Triggered Fetchers

A user-triggered fetcher is a bot that visits a specific URL because a specific human asked an AI system to visit it. The fetch is a proxy request on behalf of the person who typed a query or clicked a link inside an AI product, not an autonomous crawl.

This is the category where vendor policies diverge the most. The robots.txt position is vendor-level, not category-level. Read the subsections carefully before assuming that blocking one fetcher blocks them all.

Google-Agent and Google's full fetcher list

Google-Agent user-agent: Standard Chrome desktop and mobile strings, with fetches originating from user-triggered-agents.json IP ranges
Purpose: Agents on Google infrastructure navigate the web and perform actions on the user's behalf, including Project Mariner
Respects robots.txt: No. Google's documentation states verbatim: "Because the fetch was requested by a user, these fetchers generally ignore robots.txt rules."
Web Bot Auth: Experimental. Vendor quote: "Google is also experimenting with the web-bot-auth protocol, using the https://agent.bot.goog identity."
Published IP ranges: https://developers.google.com/static/crawling/ipranges/user-triggered-agents.json (a dedicated file, separate from other Google user-triggered fetchers)
Primary source: https://developers.google.com/crawling/docs/crawlers-fetchers/google-user-triggered-fetchers

Google-Agent is the headline case in this category. It is the only Google fetcher with its own dedicated IP range file, the only Google bot with documented Web Bot Auth experimentation, and the archetypal example of a vendor that treats user-triggered fetches as browser-like requests that robots.txt does not apply to. I wrote about Google-Agent in depth in Google-Agent: The Web's New Visitor Just Got an Identity.

Google's user-triggered fetchers list is not limited to Google-Agent. The same robots.txt-ignoring policy covers twelve other bots at the time of writing, all documented on the same Google page:

Bot	Purpose
Google-NotebookLM	Fetches URLs provided as NotebookLM project sources
Google-Read-Aloud	Fetches pages for text-to-speech playback
FeedFetcher-Google	Fetches RSS and Atom feeds for Google News and WebSub
Google-CWS	Fetches URLs from Chrome Web Store listings
GoogleMessages	Generates link previews in Google Messages
Google-Pinpoint	Fetches URLs for Pinpoint document collections
GoogleProducer	Processes publisher feeds for Google News landing pages
Google-Site-Verification	Fetches Search Console verification tokens

All of them share the blanket rule: user-initiated, therefore robots.txt does not apply.

ChatGPT-User (OpenAI)

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/bot)
Purpose: Visits URLs when users ask ChatGPT or a CustomGPT to browse a specific page. Not used for training.
Respects robots.txt: Yes. OpenAI treats ChatGPT-User as an independently controllable bot. Disallowing it in robots.txt prevents ChatGPT from browsing the website when a user asks.
Published IP ranges: https://openai.com/chatgpt-user.json
Primary source: https://platform.openai.com/docs/bots

Claude-User (Anthropic)

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; [email protected])
Purpose: Fetches web pages when a Claude user asks a question that requires current information. Not used for training.
Respects robots.txt: Yes
Published IP ranges: None published
Primary source: https://support.claude.com/en/articles/8896518

Anthropic's position stands in direct contrast to Google's. Claude-User and Google-Agent perform functionally identical jobs, but Anthropic honors robots.txt and Google does not. An operator who wants to uniformly block user-triggered fetchers through robots.txt will succeed against Claude-User and fail against Google-Agent.

Perplexity-User

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)
Purpose: Real-time retrieval during Perplexity user queries
Respects robots.txt: Per Perplexity's documentation, yes. Caveat: the same August 2025 Cloudflare investigation that flagged PerplexityBot also named Perplexity-User behavior as inconsistent with the stated policy.
Published IP ranges: https://www.perplexity.com/perplexity-user.json
Primary source: https://docs.perplexity.ai/guides/bots

MistralAI-User (Mistral AI)

User-agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots)
Purpose: User-triggered retrieval when a Le Chat user issues a query requiring real-time data
Respects robots.txt: Yes, per Mistral documentation
Published IP ranges: None published
Primary source: https://docs.mistral.ai/robots

The user-triggered fetcher takeaway

The key finding for this category, worth stating as a standalone sentence for future reference:

Respecting robots.txt in user-triggered fetchers is a vendor-level policy, not a category-level one. Google ignores robots.txt across its user-triggered fetchers as a blanket rule. Anthropic and OpenAI respect it. Perplexity claims to and has been documented as inconsistent. Mistral respects it. The practical consequence for an operator is that a uniform "block user-triggered fetchers" policy through robots.txt only partially works, and the parts it does not cover require server-side controls.

Category 4: Opt-Out Tokens (Not Crawlers)

Two commonly mentioned "AI bots" are not crawlers at all. They are robots.txt control tokens, which means they never issue HTTP requests, never appear in access logs, and cannot be blocked at the firewall or CDN level. They exist only as User-agent strings inside robots.txt files.

The distinction between opt-out tokens and actual crawlers matters because most SEO listicles treat Google-Extended and Applebot-Extended as if they were crawlers, and operators who go looking for those names in their logs will never find them.

Google-Extended

What it is: A robots.txt control token, not a crawler.
Vendor quote: "Google-Extended doesn't have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity."
What it controls: Whether content crawled by Google's existing crawlers may be used for training Gemini models and for grounding in Gemini Apps and Vertex AI generative features.
Does not affect: Google Search inclusion or ranking. Google states explicitly that Google-Extended has no search impact.
Launch: September 28, 2023
Primary source: https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers

Example robots.txt entry:

User-agent: Google-Extended
Disallow: /

Applebot-Extended

What it is: A robots.txt control token, not a crawler.
Vendor quote: "Applebot-Extended does not crawl webpages. Instead, Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent."
What it controls: Whether content crawled by Applebot may be used to train Apple's foundation models powering Apple Intelligence.
Does not affect: Siri, Spotlight, or Safari Suggestions inclusion. Apple states that webpages disallowing Applebot-Extended still appear in search results.
Launch: 2024, alongside Apple's Apple Intelligence announcement
Primary source: https://support.apple.com/en-us/119829

Example robots.txt entry:

User-agent: Applebot-Extended
Disallow: /

The critical distinction

Google-Extended and Applebot-Extended are the AI training opt-out layer for the two largest search-and-AI platforms. They are the mechanism for an operator who wants to stay visible in search (Google Search, Apple Spotlight) while opting out of training the generative AI features those same vendors ship. They are architecturally different from every other entry in this article because they do not describe traffic, they describe intent.

A common operator mistake is adding a Disallow: / for these tokens and then assuming the website is protected from all Google and Apple AI crawling. It is not. The tokens only control the training use of content crawled by the underlying search crawlers. Traffic from Google-Agent, Google-CloudVertexBot, or any Google fetcher category is governed by different rules in different places.

Category 5: Undeclared and Masquerading Traffic

The first four categories describe bots that identify themselves. Category 5 describes traffic that does not, or that claims to be something it is not. This is the hardest category to manage because the operator cannot make a policy decision about a bot they cannot reliably identify.

Bytespider (ByteDance)

ByteDance operates a crawler widely reported to train Doubao and other ByteDance LLMs. The commonly cited user agent is Mozilla/5.0 (compatible; Bytespider; [email protected]), but ByteDance has not published a vendor documentation page describing Bytespider's purpose, robots.txt position, or IP ranges. Behavioral reports from CDN operators describe robots.txt non-compliance and user agent spoofing.

The operational reality: an operator who wants to block Bytespider cannot rely on vendor cooperation. Blocks typically happen at the WAF or CDN layer using behavioral signals rather than robots.txt directives.

xAI's Grok crawler

xAI publishes no crawler documentation page. Multiple behavioral reports describe Grok's retrieval traffic as using residential IP rotation and spoofed Safari or Chrome user agents, making it functionally indistinguishable from a human visitor at the UA layer. No JSON IP range file exists. No robots.txt contract exists. From the website operator's perspective, Grok fetches look like unidentified browser traffic, with no AI bot signal to act on.

xAI's Grok operation is a high-profile example of why Category 5 needs to exist as a category at all. A comprehensive AI bot policy that relies on user-agent classification will have a blind spot the size of xAI's entire crawling operation.

Microsoft Copilot Actions

Microsoft's Copilot Actions (the agentic automation feature in Microsoft 365 Copilot) uses standard Edge or Chromium browser user agents with no bot signal. Microsoft has not published a distinct Copilot-specific user agent for agentic traffic. From the operator's log perspective, Copilot Actions traffic looks like a regular Edge user.

Microsoft is not alone in this. Every agentic browser covered in The Agentic Browser Landscape in 2026 that operates by driving a real browser session (Claude for Chrome, Perplexity Comet, ChatGPT Atlas, Opera Neon, and others) produces traffic that looks like the underlying browser's user agent. The AI agent is present, but the HTTP layer is silent about it.

Firecrawl and multi-tenant scraping services

Firecrawl is a scraper-as-a-service platform. Its default FirecrawlAgent user agent respects robots.txt, but Firecrawl is multi-tenant: many different customers' agents and applications drive the same crawler. Blocking FirecrawlAgent blocks every downstream customer simultaneously, which may include legitimate agentic tools an operator actually wants to reach. The category is a policy trap more than a technical one.

Category 5 is about detection, not blocking

An operator dealing with Category 5 traffic is no longer in the land of robots.txt directives. The tools are IP reputation, behavioral detection, rate limiting, CAPTCHAs, server-side authentication, and (when the operator wants the traffic) cryptographic identity verification via Web Bot Auth. That last one is where the identity layer is heading.

The Identity Layer: Beyond User-Agent Strings

User agent strings can be spoofed by anyone. For the Categories 1 through 4 that publish documentation, operators can combine the declared UA string with reverse DNS verification or IP allow-listing against published JSON files. For Category 5 traffic, that method fails.

Before a website can optimize for machines (which is the whole premise of machine-first architecture), it needs to know which machines are visiting. Three mechanisms are converging to solve that identification problem in a way that survives spoofing.

Web Bot Auth

Web Bot Auth is an IETF draft protocol (draft-meunier-web-bot-auth-architecture, currently at version -05 published March 2, 2026, authored by Thibault Meunier of Cloudflare and Sandor Major of Google) that applies HTTP Message Signatures (RFC 9421) to automated traffic. Each bot operator generates an Ed25519 keypair, publishes the public key as a JWKS at /.well-known/http-message-signatures-directory on a domain the operator controls, and signs every outbound HTTP request. Websites verify the signature and know, with cryptographic certainty, that the visitor came from the claimed operator.

An IETF Working Group was chartered in early 2026 specifically for this work. The webbotauth WG has published milestones targeting standards-track specifications for authentication techniques and bot information mechanisms by April 2026, and a Best Current Practice operational document by August 2026. The protocol is moving from individual draft to standards-track faster than most IETF work.

Vendor support verified from primary sources:

Google: Experimenting. Vendor quote: "Google is also experimenting with the web-bot-auth protocol, using the https://agent.bot.goog identity." Applied specifically to Google-Agent at the time of writing.
Cloudflare: Fully implemented. Web Bot Auth is integrated into Cloudflare's Verified Bots program, with signature verification happening at the Cloudflare edge. Cloudflare publishes the reference implementation at github.com/cloudflare/web-bot-auth in Rust and TypeScript.
Akamai: Supports Web Bot Auth signature verification on the Akamai edge.
Amazon Bedrock AgentCore Browser: Preview support announced October 2025. AgentCore Browser signs outbound HTTP requests with a private key when Web Bot Auth is enabled.
AWS + Cloudflare joint key registry: Announced February 2026. A collaborative open format for publishing agent public keys at scale.

OpenAI has been named as an ecosystem partner in Akamai and AWS documentation describing Web Bot Auth, but has not published its own vendor documentation confirming adoption at the time of writing. Anthropic, Perplexity, and Mistral have not documented Web Bot Auth support.

Verified bot directories

For operators who do not run their own signature verification, CDN-level bot directories are the practical fallback.

Cloudflare Verified Bots is a curated list of bots Cloudflare approves and attaches identity to at the edge. Available as the cf.verified_bot boolean and cf.verified_bot_category string for use in WAF and rate limiting rules. Historically IP-based; as of 2025, Web Bot Auth signatures are a first-class verification method.
Akamai Bot Manager classifies bots in the "agentic era" with intent and identity tags that go beyond good/bad, available to Akamai customers.
Known Agents (the service formerly known as Dark Visitors, at knownagents.com) is a community-curated catalog of AI bot user agents with auto-generated robots.txt blocks. Not an authoritative source in the way Cloudflare and Akamai are, but the most comprehensive public catalog for operators who want a starting list.

Reverse DNS and IP verification (the old way, still useful)

For bots that publish IP ranges, reverse DNS with forward confirmation remains the baseline verification method. The pattern is: take the visitor's IP, look up the PTR record, confirm it resolves to a vendor-controlled domain (e.g., *.googlebot.com or *.applebot.apple.com), then do a forward A or AAAA lookup on that hostname and confirm it matches the original IP. This protects against UA spoofing and has been the standard verification technique for over a decade.

A Note on llms.txt

Discussion of AI user agents eventually reaches llms.txt, the proposed convention for Markdown files at /llms.txt that describe a website's content to AI systems. This section is deliberately honest about where llms.txt stands in April 2026.

What llms.txt is

llms.txt is a proposal by Jeremy Howard of Answer.AI, published September 3, 2024. The specification lives at llmstxt.org and in the AnswerDotAI/llms-txt GitHub repository. The format is a Markdown file at the root of a domain, with an H1 site name (required), a blockquote summary, and H2 sections containing link lists that describe the website's key resources. A companion llms-full.txt convention is a single consolidated Markdown dump of an entire documentation set intended for direct ingestion into an LLM context window.

What llms.txt is not

llms.txt is not an IETF draft. It is not a W3C recommendation. It is not endorsed by any standards body. It is an informal community proposal that has remained in the same status for nineteen months.

Who publishes `llms.txt`

Adoption on the publishing side is real. Anthropic publishes llms.txt and llms-full.txt at docs.claude.com. OpenAI publishes at platform.openai.com/docs/llms.txt. Perplexity publishes at docs.perplexity.ai/llms-full.txt. Cloudflare publishes a large product-segmented llms.txt. Mintlify, the documentation platform, generates llms.txt automatically for all hosted documentation sites, which is the single largest driver of publish-side adoption. Google added an llms.txt to its own developer documentation in December 2025.

Who reads `llms.txt`

As of April 2026, no major LLM vendor has documented that their crawlers or retrieval systems consume llms.txt files from external websites. Publishing llms.txt is currently a symbolic gesture, not an operational control.

Google has explicitly said otherwise. Both Gary Illyes and John Mueller, Googlers who speak publicly about crawler behavior, have stated during 2025 that Google does not support llms.txt and that no AI system currently uses it as an operational signal. Those statements are the most-cited official Google position on the format and have not been walked back.

OpenAI, Anthropic, Perplexity, Meta, and Mistral have all published llms.txt files for their own documentation but have not committed in vendor documentation to reading the format from other websites. Observational data from CDN providers shows minimal crawler traffic to llms.txt files on third-party websites.

Should you publish `llms.txt` anyway?

The honest answer is that publishing llms.txt costs almost nothing and protects against the possibility that the format does eventually get picked up by one or more vendors. Treat it the same way you treated early sitemap.xml: a cheap signal that may or may not matter, worth deploying as a hedge rather than as a load-bearing SEO tactic.

Do not rely on llms.txt as an AI access control mechanism. It is not one, and the vendors that operate the crawlers have either said nothing or said no.

This website publishes an llms.txt and an llms-full.txt. Neither is believed to have material effect on AI retrieval in 2026.

How to Identify AI Bots in Your Server Logs

Three practical techniques for operators who want to start measuring AI agent traffic today.

1. Grep the user agent string

Most declared AI bots include a recognizable substring. Searching access logs for any of the following catches most Categories 1 through 3 traffic:

GPTBot|OAI-SearchBot|ChatGPT-User
ClaudeBot|Claude-User|Claude-SearchBot
PerplexityBot|Perplexity-User
Google-Agent|Google-NotebookLM|Google-Read-Aloud|Google-CloudVertexBot
Amazonbot|CCBot|Applebot
meta-externalagent
DuckAssistBot
MistralAI-User
bingbot

The pattern captures the declared AI user agents. It does not catch Category 5 traffic.

2. Cross-reference IP ranges

For higher-confidence identification, compare the visitor's IP against the vendor's published JSON IP range file. OpenAI, Google, Common Crawl, Perplexity, Bing, and others publish machine-readable ranges. This is the canonical verification technique for operators who want to be certain a request claiming to be GPTBot actually came from OpenAI.

3. Reverse DNS with forward confirmation

For vendors that do not publish IP range files (Anthropic is the notable case), reverse DNS is the fallback. dig -x <ip> returns a PTR record. If the hostname belongs to the claimed vendor (e.g., *.googlebot.com or *.applebot.apple.com), a forward lookup on the hostname should return the original IP. If it does not, the request is a spoof.

The Access Policy Matrix

This section does not tell you what to allow or block. Operators have different threat models, business models, and competitive positions. It gives the framework for making an informed decision.

Four questions to ask for every category:

Does this traffic bring business value? Retrieval crawlers bring AI search visibility, which may be directly measurable through referral traffic from ChatGPT, Perplexity, and others. Training crawlers bring no immediate value but contribute to long-term model capability. User-triggered fetchers bring one-off task completion value from specific users who sent an AI to the website on purpose. Undeclared traffic brings neither value nor accountability.
Does the vendor respect your stated wishes? Most declared training and retrieval crawlers respect robots.txt. User-triggered fetchers do not, uniformly. Category 5 traffic ignores robots.txt by definition.
Is the vendor transparent about what they do with the data? Vendors with documented purposes, IP range files, and crawler contact addresses are accountable. Vendors with no documentation are not. That transparency itself is policy-relevant.
What does blocking cost you? Blocking OAI-SearchBot removes the website from ChatGPT Search. Blocking Claude-User prevents Claude from completing research tasks users asked it to perform on your behalf. Blocking CCBot removes the website from the upstream dataset that trains most open-source LLMs. Every block has a cost, and the cost is different per category.

A defensible default for most businesses is to allow Category 2 (search and retrieval crawlers), make an explicit decision on Category 1 (training crawlers) based on content policy, rely on Category 4 tokens (Google-Extended, Applebot-Extended) to cleanly opt out of Google and Apple training while keeping search inclusion, accept that Category 3 fetchers that ignore robots.txt require server-side controls or nothing at all, and accept that Category 5 is a detection problem rather than a robots.txt problem. A sample starter robots.txt:

# Allow AI search and retrieval crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

# Opt out of Google and Apple generative AI training
User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Explicit decisions on training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

The above is a starting point for an operator who wants AI search visibility without contributing to foundation model training. Operators with different priorities should change it. The point is that the decision is per-category, not global.

What This Means for Your Website

The practical actions for an operator in April 2026 are:

Start logging AI bot traffic now. Whatever policy you settle on, it should be informed by real data about which bots are visiting and how often. Every published IP range and documented UA string above is a starting point for instrumentation.
Stop treating "AI bots" as one category. The policy decision for GPTBot is not the policy decision for Google-Agent is not the policy decision for Bytespider. One rule in robots.txt cannot make three different decisions.
Decide about training separately from decisions about retrieval. Google-Extended and Applebot-Extended exist specifically so those decisions can be separated. Use them.
Accept that robots.txt no longer carries the full load. For Google user-triggered fetchers, Category 5 traffic, and any operator who needs a hard block that cannot be bypassed, server-side controls are the answer. robots.txt is a signal, not a contract, and for roughly half the AI traffic in 2026, it is a signal that will be ignored.
Watch the identity layer. Web Bot Auth is where verification is heading. Within the next twelve months, cryptographic agent identity will move from Google and Cloudflare experiments to a standards-track protocol with multi-vendor support. Operators who start tracking Signature-Agent headers today will be ready for the shift.

The web now has a machine-readable identity layer in addition to its human-readable one. The identities are still being issued, the standards are still being written, and the policies are still being decided. The only wrong move is pretending it's still 2023 and one Disallow: / covers everything.