Robots.txt for AI Bots in 2026: Complete Configuration for GPTBot, PerplexityBot, ClaudeBot, GoogleBot AI

Robots.txt for AI Bots in 2026: Complete Configuration for GPTBot, PerplexityBot, ClaudeBot, GoogleBot AI

Your robots.txt file is the single biggest invisible blocker between your content and AI search citations. The CapstonAI Q1 2026 cohort audit found that 41% of B2B sites still block at least one major AI bot — usually a leftover from the 2023-2024 “block everything” panic. Each blocked bot costs an estimated 18-34% of potential AI citations on that engine. Sites that unblocked GPTBot + PerplexityBot + ClaudeBot in Q4 2025 saw +186% AI-attributed traffic in 90 days. Here’s the complete 2026 robots.txt configuration with copy-paste blocks for every major AI crawler.

TL;DR: Configure robots.txt by: (1) auditing currently-blocked AI bots, (2) explicitly allowing GPTBot, PerplexityBot, ClaudeBot, Google-Extended, OAI-SearchBot, ChatGPT-User, (3) deciding training-vs-search bot policy, (4) keeping sensitive paths blocked for ALL bots, (5) adding sitemap reference, (6) validating with each engine’s documented user-agent, (7) re-checking after every CMS upgrade.

Free CapstonAI scan →    Pricing

The 9-step technical playbook

  1. Step 1: Audit your current robots.txt. Open https://yourdomain.com/robots.txt in a browser. Search for: GPTBot, PerplexityBot, ClaudeBot, Google-Extended, anthropic-ai, CCBot, Omgilibot, FacebookBot, Bytespider. Any “Disallow: /” entry under these = invisible to that engine. Document baseline before changes.
  2. Step 2: Understand the two bot families per engine. Most AI providers run TWO bots: a TRAINING crawler (e.g. GPTBot, Google-Extended, ClaudeBot) and a SEARCH/RAG crawler used at query time (e.g. OAI-SearchBot, ChatGPT-User, PerplexityBot, Claude-SearchBot). Blocking the search bot = no citations. Blocking the training bot = no model knowledge of you. Decide per bot.
  3. Step 3: Allow the search/citation bots (recommended for everyone). Copy-paste this block at the TOP of your robots.txt — these bots fetch pages at query time to cite you:
    User-agent: OAI-SearchBot
    Allow: /
    
    User-agent: ChatGPT-User
    Allow: /
    
    User-agent: PerplexityBot
    Allow: /
    
    User-agent: Perplexity-User
    Allow: /
    
    User-agent: ClaudeBot
    Allow: /
    
    User-agent: Claude-SearchBot
    Allow: /
    
    User-agent: Claude-User
    Allow: /
    
    User-agent: Google-Extended
    Allow: /
    
    User-agent: Applebot-Extended
    Allow: /
  4. Step 4: Decide your training-bot policy. Training bots (GPTBot, Google-Extended for Gemini training, anthropic-ai, CCBot, Bytespider) feed model knowledge. ALLOW if you want long-term brand familiarity in models (recommended for B2B and editorial). DISALLOW only if you have proprietary content you don’t want memorized. Default for CapstonAI cohort: allow training bots on marketing site, disallow on app/SaaS UI.
  5. Step 5: Keep sensitive paths blocked for ALL bots. Even when allowing AI bots, keep admin/auth/checkout/account paths blocked universally:
    User-agent: *
    Disallow: /wp-admin/
    Disallow: /wp-login.php
    Disallow: /checkout/
    Disallow: /account/
    Disallow: /cart/
    Disallow: /search?
    Disallow: /*?utm_
    
    Sitemap: https://yourdomain.com/sitemap.xml
    Sitemap: https://yourdomain.com/sitemap-news.xml
  6. Step 6: Add a complete reference robots.txt. Production-ready 2026 template (copy as starting point):
    # CapstonAI 2026 reference robots.txt
    # Allow AI search/citation bots
    User-agent: OAI-SearchBot
    User-agent: ChatGPT-User
    User-agent: PerplexityBot
    User-agent: Perplexity-User
    User-agent: ClaudeBot
    User-agent: Claude-SearchBot
    User-agent: Google-Extended
    User-agent: Applebot-Extended
    Allow: /
    
    # Default for everyone else
    User-agent: *
    Allow: /
    Disallow: /wp-admin/
    Disallow: /wp-login.php
    Disallow: /checkout/
    Disallow: /account/
    Disallow: /cart/
    Disallow: /*?utm_
    Disallow: /*?fbclid
    
    Sitemap: https://yourdomain.com/sitemap.xml
  7. Step 7: Validate per-bot fetchability. Test each user-agent with curl: curl -A "PerplexityBot" -I https://yourdomain.com/your-page/ — expect HTTP 200 + no X-Robots-Tag: noindex. Repeat for GPTBot, ClaudeBot, OAI-SearchBot. Also check Cloudflare/CDN bot-management settings — many CDNs block AI bots at the WAF layer regardless of robots.txt.
  8. Step 8: Check CDN + WAF + bot-management overrides. Cloudflare “Block AI Bots” toggle (added 2024) overrides your robots.txt. Same for AWS WAF managed rules and Akamai Bot Manager. If you allowed bots in robots.txt but they still 403, the block is upstream. Disable AI-bot blocking rules in your CDN dashboard.
  9. Step 9: Re-validate quarterly + after every CMS migration. Robots.txt regressions happen on every WordPress migration, theme update, security plugin install, and CDN config change. Add quarterly robots.txt audit to your SEO calendar. Set up a monitor (Pingdom, UptimeRobot, or CapstonAI’s robots watcher) to alert on changes.

Concrete case study

Real customer pattern (anonymized) showing the impact of this implementation over one quarter:

Metric Before unblocking AI bots After (90 days) Delta
AI-attributed sessions (GA4) ~110/mo ~880/mo +700%
ChatGPT citations (panel of 30) 3 19 +16
Perplexity citations (panel of 30) 1 22 +21
Claude citations (panel of 25) 0 14 +14
Google AI Overview appearances 5 27 +22

Common technical errors when implementing robots.txt for AI bots

  • Disallow: / under User-agent: GPTBot left over from 2023. The most common audit finding. Removes you from ChatGPT model knowledge AND search-time citations. Fix immediately.
  • Blocking only the training bot but forgetting the search bot. Allowing GPTBot but blocking OAI-SearchBot still kills your ChatGPT citations. You need BOTH bots allowed for full ChatGPT visibility.
  • Cloudflare “Block AI Bots” toggle still ON. Overrides robots.txt. Check Cloudflare → Security → Bots → AI Scrapers and Crawlers — set to “Allow”.
  • Robots.txt blocking /sitemap.xml or sitemap path. Bots can’t discover your URLs. Always Allow the sitemap path explicitly.
  • Using noindex meta on pages you want AI-cited. Noindex affects all crawlers including AI search bots. If a page should be AI-cited, it must be indexable.

FAQ — robots.txt for AI bots

Do AI bots actually respect robots.txt?

Major commercial bots (GPTBot, PerplexityBot, ClaudeBot, Google-Extended) publicly commit to robots.txt compliance and have been observed honoring it in CapstonAI’s 18-month bot-traffic logs. Bytespider (TikTok) and some scraper bots are less reliable — block at the WAF if you need a hard stop.

Should I block GPTBot to protect my content from ChatGPT training?

Trade-off: blocking GPTBot removes you from future ChatGPT model knowledge — meaning ChatGPT users who ask about your category may never see your brand surface organically. Most B2B brands net-benefit from allowing it. Block only if you have genuinely proprietary content you can’t risk being memorized.

What’s the difference between GPTBot and OAI-SearchBot?

GPTBot = OpenAI’s training crawler (feeds model weights, runs continuously). OAI-SearchBot = OpenAI’s search index crawler (powers ChatGPT search results). ChatGPT-User = the live fetch when a ChatGPT user clicks a citation link. All three are separate user-agents and need separate robots.txt entries.

Tools and related reading

Ready to ship robots.txt for AI bots the right way?

Free CapstonAI scan →

Last updated: May 2026. Sources: Schema.org documentation (https://schema.org/), Wikidata WikiProject Informatics, OpenAI bot documentation (platform.openai.com/docs/bots), Anthropic crawler documentation (anthropic.com/claudebot), Perplexity bot disclosure, Google-Extended documentation (developers.google.com/search/docs/crawling-indexing/google-common-crawlers), llmstxt.org (Howard et al., 2024), CapstonAI Q1 2026 cohort benchmark (86 customers, 24 800 LLM responses analyzed).