Robots.txt For AI Bots: Control GPTBot, Google-Extended & More

A robots.txt for AI bots tells AI crawlers — such as GPTBot, Google-Extended and PerplexityBot — which parts of your site they may access, giving you control over whether your content can be read and used by AI engines.

TL;DR

What it is: rules in robots.txt that allow or block named AI crawler user-agents.
Why it matters: it decides whether AI engines can read — and potentially cite — your content.
Pair it with llms.txt: robots.txt controls access; llms.txt guides AI engines to your best content.

Robots.txt for AI Bots in 2026: Complete Configuration for GPTBot, PerplexityBot, ClaudeBot, GoogleBot AI

Your robots.txt file is the single biggest invisible blocker between your content and AI search citations. The CapstonAI Q1 2026 cohort audit found that 41% of B2B sites still block at least one major AI bot — usually a leftover from the 2023-2024 “block everything” panic. Each blocked bot costs an estimated 18-34% of potential AI citations on that engine. Sites that unblocked GPTBot + PerplexityBot + ClaudeBot in Q4 2025 saw +186% AI-attributed traffic in 90 days. Here’s the complete 2026 robots.txt configuration with copy-paste blocks for every major AI crawler.

TL;DR: Configure robots.txt by: (1) auditing currently-blocked AI bots, (2) explicitly allowing GPTBot, PerplexityBot, ClaudeBot, Google-Extended, OAI-SearchBot, ChatGPT-User, (3) deciding training-vs-search bot policy, (4) keeping sensitive paths blocked for ALL bots, (5) adding sitemap reference, (6) validating with each engine’s documented user-agent, (7) re-checking after every CMS upgrade.

Free CapstonAI scan → Pricing

The 9-step technical playbook

Step 1: Audit your current robots.txt. Open https://yourdomain.com/robots.txt in a browser. Search for: GPTBot, PerplexityBot, ClaudeBot, Google-Extended, anthropic-ai, CCBot, Omgilibot, FacebookBot, Bytespider. Any “Disallow: /” entry under these = invisible to that engine. Document baseline before changes.
Step 2: Understand the two bot families per engine. Most AI providers run TWO bots: a TRAINING crawler (e.g. GPTBot, Google-Extended, ClaudeBot) and a SEARCH/RAG crawler used at query time (e.g. OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot). Blocking the search bot = no citations. Blocking the training bot = no model knowledge of you. Decide per bot.

Step 3: Allow the search/citation bots (recommended for everyone). Copy-paste this block at the TOP of your robots.txt — these bots fetch pages at query time to cite you:

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

Step 4: Decide your training-bot policy. Training bots (GPTBot, Google-Extended for Gemini training, anthropic-ai, CCBot, Bytespider) feed model knowledge. ALLOW if you want long-term brand familiarity in models (recommended for B2B and editorial). DISALLOW only if you have proprietary content you don’t want memorized. Default for CapstonAI cohort: allow training bots on marketing site, disallow on app/SaaS UI.

Step 5: Keep sensitive paths blocked for ALL bots. Even when allowing AI bots, keep admin/auth/checkout/account paths blocked universally:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /checkout/
Disallow: /account/
Disallow: /cart/
Disallow: /search?
Disallow: /*?utm_

Sitemap: https://yourdomain.com/sitemap.xml
Sitemap: https://yourdomain.com/sitemap-news.xml

Step 6: Add a complete reference robots.txt. Production-ready 2026 template (copy as starting point):

# CapstonAI 2026 reference robots.txt
# Allow AI search/citation bots
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: PerplexityBot
User-agent: Perplexity-User
User-agent: ClaudeBot
User-agent: Claude-SearchBot
User-agent: Google-Extended
User-agent: Applebot-Extended
Allow: /

# Default for everyone else
User-agent: *
Allow: /
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /checkout/
Disallow: /account/
Disallow: /cart/
Disallow: /*?utm_
Disallow: /*?fbclid

Sitemap: https://yourdomain.com/sitemap.xml

Step 7: Validate per-bot fetchability. Test each user-agent with curl: curl -A "PerplexityBot" -I https://yourdomain.com/your-page/ — expect HTTP 200 + no X-Robots-Tag: noindex. Repeat for GPTBot, ClaudeBot, OAI-SearchBot. Also check Cloudflare/CDN bot-management settings — many CDNs block AI bots at the WAF layer regardless of robots.txt.
Step 8: Check CDN + WAF + bot-management overrides. Cloudflare “Block AI Bots” toggle (added 2024) overrides your robots.txt. Same for AWS WAF managed rules and Akamai Bot Manager. If you allowed bots in robots.txt but they still 403, the block is upstream. Disable AI-bot blocking rules in your CDN dashboard.
Step 9: Re-validate quarterly + after every CMS migration. Robots.txt regressions happen on every WordPress migration, theme update, security plugin install, and CDN config change. Add quarterly robots.txt audit to your SEO calendar. Set up a monitor (Pingdom, UptimeRobot, or CapstonAI’s robots watcher) to alert on changes.

Concrete case study

Real customer pattern (anonymized) showing the impact of this implementation over one quarter:

Metric	Before unblocking AI bots	After (90 days)	Delta
AI-attributed sessions (GA4)	~110/mo	~880/mo	+700%
ChatGPT citations (panel of 30)	3	19	+16
Perplexity citations (panel of 30)	1	22	+21
Claude citations (panel of 25)	0	14	+14
Google AI Overview appearances	5	27	+22

Common technical errors when implementing robots.txt for AI bots

Disallow: / under User-agent: GPTBot left over from 2023. The most common audit finding. Removes you from ChatGPT model knowledge AND search-time citations. Fix immediately.
Blocking only the training bot but forgetting the search bot. Allowing GPTBot but blocking OAI-SearchBot still kills your ChatGPT citations. You need BOTH bots allowed for full ChatGPT visibility.
Cloudflare “Block AI Bots” toggle still ON. Overrides robots.txt. Check Cloudflare → Security → Bots → AI Scrapers and Crawlers — set to “Allow”.
Robots.txt blocking /sitemap.xml or sitemap path. Bots can’t discover your URLs. Always Allow the sitemap path explicitly.
Using noindex meta on pages you want AI-cited. Noindex affects all crawlers including AI search bots. If a page should be AI-cited, it must be indexable.

FAQ — robots.txt for AI bots

Do AI bots actually respect robots.txt?

Major commercial bots (GPTBot, PerplexityBot, ClaudeBot, Google-Extended) publicly commit to robots.txt compliance and have been observed honoring it in CapstonAI’s 18-month bot-traffic logs. Bytespider (TikTok) and some scraper bots are less reliable — block at the WAF if you need a hard stop.

Should I block GPTBot to protect my content from ChatGPT training?

Trade-off: blocking GPTBot removes you from future ChatGPT model knowledge — meaning ChatGPT users who ask about your category may never see your brand surface organically. Most B2B brands net-benefit from allowing it. Block only if you have genuinely proprietary content you can’t risk being memorized.

What’s the difference between GPTBot and OAI-SearchBot?

GPTBot = OpenAI’s training crawler (feeds model weights, runs continuously). OAI-SearchBot = OpenAI’s search index crawler (powers ChatGPT search results). ChatGPT-User = the live fetch when a ChatGPT user clicks a citation link. All three are separate user-agents and need separate robots.txt entries.

Tools and related reading

Ready to ship robots.txt for AI bots the right way?

Free CapstonAI scan →

Last updated: May 2026. Sources: Schema.org documentation (https://schema.org/), Wikidata WikiProject Informatics, OpenAI bot documentation (platform.openai.com/docs/bots), Anthropic crawler documentation (anthropic.com/claudebot), Perplexity bot disclosure, Google-Extended documentation (developers.google.com/search/docs/crawling-indexing/google-common-crawlers), llmstxt.org (Howard et al., 2024), CapstonAI Q1 2026 cohort benchmark (86 customers, 24 800 LLM responses analyzed).

Frequently asked questions

What is a robots.txt for AI bots?

It is a robots.txt file that includes rules for AI crawler user-agents, allowing or disallowing crawlers such as GPTBot, Google-Extended and PerplexityBot from accessing your content.

Should I block AI bots in robots.txt?

It is a trade-off: blocking protects content from being used in AI training and answers, but also removes the chance of being cited by AI engines. Most brands seeking AI visibility allow the crawlers that power answer engines.

What is the difference between robots.txt and llms.txt?

robots.txt controls which crawlers may access your site; llms.txt is a curated map that guides AI models to your most important content. They are complementary.