Robots.txt Generator
Easily create a robots.txt file to control search engine and AI crawler access to your website.
Quick Presets
Crawler Access Settings
Search Engines
AI Crawlers
Restricted Directories
Custom Rules
Additional Settings
Generated Output
What is Robots.txt?
robots.txt is a text file placed at the root of your website that tells search engine crawlers which pages they can or cannot access. It must be located at https://domain/robots.txt and encoded in UTF-8.
Key Directives
User-agent: Specifies the crawler the rule applies to. * means all crawlers.
Allow: Permits crawling of the specified path. Takes precedence over Disallow.
Disallow: Blocks crawling of the specified path.
Sitemap: Tells crawlers where your sitemap is located.
Crawl-delay: Sets seconds between requests (ignored by Google; supported by Bing, Yandex, etc.).
Real-world patterns
These are configurations commonly used on production sites. Before copying, adjust the paths to match your own site structure.
Personal blog or basic site
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
This is sufficient for most personal blogs and small sites. Simply declaring a Sitemap URL improves how quickly your pages are discovered.
WordPress standard setup
User-agent: *
Allow: /
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /feed/
Disallow: /trackback/
Disallow: /?s=
Sitemap: https://example.com/sitemap_index.xml
admin-ajax.php needs to be allowed because it is the public endpoint called by the front end. /?s= produces internal search result pages that can trigger duplicate-content issues, so blocking it is recommended.
Block AI training crawlers, allow search engines
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
Sitemap: https://example.com/sitemap.xml
Regular search crawlers (Googlebot, Bingbot, Yeti) retain full access while generative-AI training crawlers are blocked. This became practical after 2024, when major AI companies split their User-Agent strings between training and real-time browsing.
Staging or test server - full block
User-agent: *
Disallow: /
One line is all it takes to prevent a pre-production server from being accidentally indexed and incurring duplicate-content penalties. Just remember to swap it out for the allow-all version before going live.
What robots.txt can and cannot do
- Can - Request (politely) that crawlers skip certain paths, conserving crawl budget and reducing discovery of low-value pages.
- Can - Advertise your sitemap location so crawlers discover your pages more quickly. Multiple
Sitemap:lines are allowed. - Can - Selectively block AI training crawlers (GPTBot, ClaudeBot, Google-Extended, etc.). Major AI companies have publicly committed to honoring robots.txt.
- Cannot - remove indexed URLs - Blocking a URL that is already indexed may actually keep it indexed longer, because Google can no longer re-crawl the page to read a
noindexdirective. Use<meta name="robots" content="noindex">or the URL Removal tool in Search Console instead. - Cannot - secure a page - robots.txt is a public file. Listing a restricted path in it just advertises that something interesting is there. Access control belongs in authentication, IP allowlists, or a firewall - not robots.txt.
- Cannot - enforce anything - Malicious bots and scrapers ignore robots.txt entirely. Only well-behaved crawlers (search engines and reputable AI companies) honor it voluntarily.
Pre-deployment checklist
- Reachability - Open
https://yourdomain/robots.txtin a browser and confirm it returns HTTP 200 with the correct content. Each subdomain needs its own file. - Google Search Console robots.txt report - Search Console shows the parsed state, any syntax errors, the cached version, and the date it was last fetched.
- Bing Webmaster Tools robots.txt Tester - Instantly checks whether a specific URL is allowed or blocked for Bingbot.
- Case sensitivity and trailing slashes - URLs are case-sensitive.
/Admin/and/admin/are different paths. The presence or absence of a trailing slash also matters. - Wildcards -
*matches any sequence of characters;$anchors the end of a string. For example,Disallow: /*.pdf$blocks all PDF files. - Comments - Lines starting with
#are ignored by crawlers. Annotate your rules so future you can debug them without guessing. - Propagation time - Crawlers typically cache robots.txt for up to 24 hours. For urgent changes, request a manual re-crawl via Search Console.
Common mistakes
- Accidentally deploying
Disallow: /- This can wipe your entire site from search results. Always verify immediately after deploying. - Blocking CSS and JavaScript - Disallowing
/assets/or/static/prevents Google from rendering your pages correctly, harming your Core Web Vitals and mobile-friendliness scores. - Writing
Noindex:in robots.txt - Google stopped supporting this directive in 2019. Use thenoindexmeta tag or anX-Robots-TagHTTP header instead. - Inline comments on rule lines -
Disallow: /admin/ # admin areacan be misinterpreted by some parsers. Put comments on their own separate line. - Not using a sitemap index for large sites - If you have many URLs spread across multiple sitemaps, create a sitemap index file and point the
Sitemap:directive at that single index URL.
To strengthen your SEO setup alongside robots.txt, add structured data with the JSON-LD Generator and use the Schema.org Types Reference to choose markup that matches the page. You can also make your site recognizable with the Favicon Generator.
FAQ
Is robots.txt required?
No. Without one, crawlers assume they can access all pages. However, it helps prevent indexing of admin pages, internal search results, and other unwanted content.
Can robots.txt completely prevent indexing?
No. robots.txt is a recommendation, not enforcement. URLs can still be indexed via external links. For full blocking, also use <meta name="robots" content="noindex">.
Can I block AI crawlers like GPTBot or ClaudeBot?
Yes. Major AI crawlers (GPTBot, ChatGPT-User, ClaudeBot, Google-Extended, CCBot) respect robots.txt. Set their User-agent with Disallow: / to block training data collection.
Do all crawlers support Crawl-delay?
Google ignores Crawl-delay. Bing, Yandex, and Naver (Yeti) support it. For Google, adjust crawl rate via Search Console.
Where should I place the robots.txt file?
It must be at the domain root, e.g. https://example.com/robots.txt. Placing it in a subdirectory (e.g. /blog/robots.txt) will not be recognized by crawlers.