RankrizeRankrize

Free Tool · Generator

Robots.txt Generator

Build a clean robots.txt — including AI bots like GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot.

Welcome every bot — search, AI training, AI search, and social previews. The default and recommended posture for content sites that want maximum reach.

Rule 1

Use * for the catch-all rule, or pick from the AI bot list below.

Optional. Most modern crawlers ignore Crawl-delay; Yandex still uses it.

Absolute URLs only — e.g., https://example.com/sitemap.xml.

Non-standard. Only Yandex respects this; Google ignores it.

1 hint

  • No Sitemap declared. Adding one helps Google + Bing discover all your pages — it should be the absolute URL to your sitemap.xml.

robots.txt

txtPlace at the root of your domain
User-agent: *
Allow: /
Validate after deploying.

Upload your robots.txt to your domain root, then test it with Google's robots.txt tester in Search Console.

AI + search crawler reference (65 bots)

Owner, purpose, and category for every AI / search bot we track.

Showing 65 of 65
User-agentOwnerCategoryPurposeDocs
GPTBotOpenAIAI trainingTrains OpenAI's foundation models (GPT-4o, GPT-5).
OAI-SearchBotOpenAIAI search / retrievalIndexes content for OpenAI's SearchGPT product.
ChatGPT-UserOpenAIAI search / retrievalFetches a page on demand when a ChatGPT user asks the model to browse a URL.
ClaudeBotAnthropicAI search / retrievalFetches content for citation inside Claude.
anthropic-aiAnthropicAI trainingLegacy training crawler — still respected by current versions.
Claude-WebAnthropicAI search / retrievalInline browsing inside Claude conversations.
Claude-UserAnthropicAI search / retrievalUser-triggered fetch from Claude.
Claude-SearchBotAnthropicAI search / retrievalCrawler for Claude's search-augmented retrieval.
GooglebotGoogleSearch enginesWeb search index crawler.
Googlebot-ImageGoogleSearch enginesCrawls images for Google Images.
Googlebot-NewsGoogleSearch enginesSelects content for Google News inclusion.
Googlebot-VideoGoogleSearch enginesCrawls video content for video-rich results.
Google-InspectionToolGoogleSearch enginesSearch Console live URL inspection + rendering.
Google-ExtendedGoogleAI trainingOpt-in token for Gemini / Vertex training. Allowing it = opting in to AI training; blocking it = opt-out without affecting Search.
Google-CloudVertexBotGoogleAI trainingVertex AI fine-tuning corpus crawler.
GoogleOtherGoogleResearch / corporaGeneric Google research / experimental crawls.
GoogleOther-ImageGoogleResearch / corporaGeneric Google image research crawler.
GoogleOther-VideoGoogleResearch / corporaGeneric Google video research crawler.
Storebot-GoogleGoogleSearch enginesShopping crawler — product pricing, availability.
AdsBot-GoogleGoogleSearch enginesValidates Google Ads landing-page quality.
AdsBot-Google-MobileGoogleSearch enginesValidates mobile Google Ads landing-page quality.
Mediapartners-GoogleGoogleSearch enginesAdSense — checks page content for ad relevance.
BingbotMicrosoftSearch enginesBing web search index crawler.
BingPreviewMicrosoftSearch enginesBing's preview-page renderer.
MicrosoftPreviewMicrosoftAI search / retrievalBing Chat / Copilot preview crawler.
msnbotMicrosoftSearch enginesLegacy MSN crawler — still active.
PerplexityBotPerplexityAI search / retrievalBuilds Perplexity's search index.
Perplexity-UserPerplexityAI search / retrievalReal-time fetch when a Perplexity user requests a citation.
ApplebotAppleSearch enginesSpotlight, Siri, Safari Suggestions index.
Applebot-ExtendedAppleAI trainingApple Intelligence training opt-in. Allowing it = opting in to Apple's AI training.
Meta-ExternalAgentMetaAI trainingLlama-product crawler.
Meta-ExternalFetcherMetaAI search / retrievalReal-time browse on behalf of a Meta AI user.
FacebookBotMetaAI trainingMeta AI / Llama training fetcher.
facebookexternalhitMetaSocial previewsOpen Graph / link-preview fetcher for Facebook + Instagram.
BytespiderByteDanceAI trainingTrains ByteDance / Doubao models.
CCBotCommon CrawlResearch / corporaBuilds Common Crawl, the corpus most foundation models train on.
DiffbotDiffbotResearch / corporaBuilds the Diffbot Knowledge Graph.
FriendlyCrawlerWebis (research)Research / corporaCommon-Crawl-funded research crawler.
Cohere-aiCohereAI search / retrievalCohere AI search + training fetcher.
cohere-training-data-crawlerCohereAI trainingCohere model training data crawler.
MistralMistral AIAI trainingMistral / Le Chat training corpus.
MistralAI-UserMistral AIAI search / retrievalLe Chat browse-on-behalf-of-user.
GrokxAIAI search / retrievalGrok training + retrieval fetcher.
xAI-BotxAIAI trainingxAI / Grok crawler.
PhindBotPhindAI search / retrievalPhind dev-search index.
YouBotYou.comAI search / retrievalYou.com AI search crawler.
AmazonbotAmazonAI trainingAlexa + Amazon AI training crawler.
DuckDuckBotDuckDuckGoSearch enginesDuckDuckGo search index.
YandexBotYandexSearch enginesYandex search index (Russian + global).
BaiduspiderBaiduSearch enginesBaidu search index (China).
NaverbotNaverSearch enginesNaver search index (Korea).
KagibotKagiSearch enginesKagi search index.
Ai2BotAllen Institute (Ai2)AI trainingTrains open research models (Olmo, Tülu).
Ai2Bot-DolmaAllen Institute (Ai2)AI trainingBuilds the Dolma open training corpus.
TwitterbotX / TwitterSocial previewsGenerates X / Twitter card previews.
LinkedInBotLinkedInSocial previewsGenerates LinkedIn share previews.
SlackbotSlackSocial previewsUnfurls links shared in Slack.
DiscordbotDiscordSocial previewsUnfurls links shared in Discord.
WhatsAppMeta (WhatsApp)Social previewsUnfurls links shared in WhatsApp.
TelegramBotTelegramSocial previewsUnfurls links shared in Telegram.
SemrushBotSemrushSEO toolsSemrush competitor + backlink analysis.
AhrefsBotAhrefsSEO toolsAhrefs SEO + backlink crawler.
MJ12botMajesticSEO toolsMajestic backlink index.
DotBotMozSEO toolsMoz / Open Site Explorer crawler.
rogerbotMozSEO toolsMoz Pro audit crawler.
Automate it with Rankrize

Allowing AI bots ≠ getting cited by AI.

Letting GPTBot in is step one. Step two is structuring your content so it actually gets cited. Rankrize generates llms.txt, ai.txt, FAQ schema, and answer-capsule content built for AI citation. Run a free GEO audit.

Run free site analysisOne free per account · No credit card

How to use

How to use the Robots.txt Generator

  • Decide your AI posture before you decide your rules

    Three options: (1) Allow all — best for content sites that want maximum reach including AI citations. (2) Allow search, block AI training — keep ranking on Google but opt out of AI model training. (3) Block all AI — keep ranking but never appear in ChatGPT, Claude, or Perplexity answers. Start with the matching template, then customize.

  • Allowing GPTBot ≠ getting cited by ChatGPT

    Robots.txt is the door — content quality is the conversation. Allowing GPTBot lets OpenAI train on your pages; getting actually cited requires your content to be discoverable, well-structured (FAQ schema, HowTo schema, clean answer capsules), and authoritative. The Rankrize platform is built around this — schema generation + answer-capsule writing on every article.

  • Always include your sitemap URL — even if you also link it from <head>

    Search engines discover sitemaps from three places: the robots.txt Sitemap directive, Search Console submissions, and HTML <link rel="sitemap"> tags. Robots.txt is the most universal — Google, Bing, Yandex, DuckDuckGo, and AI search engines all read it. Use absolute URLs (https://example.com/sitemap.xml), not relative paths.

  • Most-specific match wins — order doesn't matter

    Per RFC 9309, when multiple rules apply to the same crawler, the most specific path match wins (longest matching pattern). Order in the file is irrelevant. So 'Disallow: /admin/' followed later by 'Allow: /admin/help' will let Googlebot read /admin/help even though /admin/ is broadly disallowed.

  • Test before deploying — and re-test after every change

    Once your robots.txt is live, test it with Google Search Console's robots.txt tester (link in the validation panel). Paste a few sample URLs, pick a user-agent, and confirm it returns the expected allow / disallow. A typo can deindex your entire site overnight — verifying takes 30 seconds.

  • Don't use robots.txt to hide private pages

    Disallow tells crawlers not to fetch a URL — it doesn't tell them not to index it. If other sites link to your /private-page, Google can still list it (with a thin snippet). For real privacy, use noindex meta tag on the page OR put it behind authentication. Robots.txt is for crawl-budget management, not access control.

Frequently asked questions

About the Robots.txt Generator

Disallow tells a crawler not to fetch a URL pattern. Allow re-permits a sub-path inside an otherwise-disallowed parent. Example: 'Disallow: /admin/' + 'Allow: /admin/admin-ajax.php' blocks the WP admin area but lets crawlers reach the JS file your front-end depends on. With no Disallow rule, everything is implicitly allowed.

Depends on your business model. If your traffic monetization model relies on people visiting your site (ads, lead gen, content sales), allowing AI crawlers is usually a net positive — they cite you with a link, driving referral traffic and brand awareness. If your business depends on scarcity (paid research reports, premium courses), you may want to block training crawlers (GPTBot, CCBot, Google-Extended) but still allow real-time AI search (ChatGPT-User, Perplexity-User, ClaudeBot).

Eventually. GPTBot is OpenAI's training crawler — blocking it tells OpenAI to exclude your content from FUTURE model training updates. Content already in the training set stays cited until that model version is retired (typically 6–18 months). To stop real-time citations in browsing-enabled ChatGPT, also block ChatGPT-User and OAI-SearchBot.

Google-Extended is the opt-in token for Google's Gemini and Vertex AI training. Allowing it = your content can be used to train future Gemini models. Blocking it = opt-out without any effect on your Google Search rankings (Googlebot is a separate crawler). If you allow Google to use your content for AI training, you increase the chances of being cited by Gemini and Google AI Overviews.

At the root of your domain — e.g., https://example.com/robots.txt. It must be a single text file at the exact path /robots.txt with no subdirectories. If you have multiple subdomains (blog.example.com, shop.example.com), each one needs its own robots.txt at its own root. WordPress, Shopify, Webflow, Framer, and Next.js all support uploading or generating it.

Whenever a new significant AI crawler launches (we update the AI bot reference table here as soon as new ones are documented), or whenever your site structure changes meaningfully (new admin paths, new private sections, new sitemaps). Otherwise, set-and-forget is fine. Most sites change theirs once or twice a year.

Yes. We track the registry against each provider's official documentation (OpenAI, Anthropic, Google, Apple, Perplexity, Meta) plus the darkvisitors.com index. Each entry links to the source's official docs page where available. The list is updated whenever a major provider announces a new crawler or renames an existing one.

Absolutely. Add a `Sitemap:` line for each one. Common pattern: a separate sitemap per content type (sitemap-pages.xml, sitemap-articles.xml, sitemap-products.xml) all referenced from a single robots.txt. Or one sitemap-index.xml that points to all your individual sitemaps — also fully supported.