Robots.txt Generator - Free Online Crawler Access Control

Quick Presets

Crawler Access Settings

Search Engines

AI Crawlers

Restricted Directories

Custom Rules

User-Agent

Allow / Disallow

Additional Settings

Sitemap URL

Crawl-delay (sec)

Generated Output

What is Robots.txt?

robots.txt is a text file placed at the root of your website that tells search engine crawlers which pages they can or cannot access. It must be located at https://domain/robots.txt and encoded in UTF-8.

Key Directives

User-agent: Specifies the crawler the rule applies to. * means all crawlers.
Allow: Permits crawling of the specified path. Takes precedence over Disallow.
Disallow: Blocks crawling of the specified path.
Sitemap: Tells crawlers where your sitemap is located.
Crawl-delay: Sets seconds between requests (ignored by Google; supported by Bing, Yandex, etc.).

Real-world patterns

These are configurations commonly used on production sites. Before copying, adjust the paths to match your own site structure.

Personal blog or basic site

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

This is sufficient for most personal blogs and small sites. Simply declaring a Sitemap URL improves how quickly your pages are discovered.

WordPress standard setup

User-agent: *
Allow: /
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /feed/
Disallow: /trackback/
Disallow: /?s=

Sitemap: https://example.com/sitemap_index.xml

admin-ajax.php needs to be allowed because it is the public endpoint called by the front end. /?s= produces internal search result pages that can trigger duplicate-content issues, so blocking it is recommended.

Block AI training crawlers, allow search engines

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Sitemap: https://example.com/sitemap.xml

Regular search crawlers (Googlebot, Bingbot, Yeti) retain full access while generative-AI training crawlers are blocked. This became practical after 2024, when major AI companies split their User-Agent strings between training and real-time browsing.

Staging or test server - full block

User-agent: *
Disallow: /

One line is all it takes to prevent a pre-production server from being accidentally indexed and incurring duplicate-content penalties. Just remember to swap it out for the allow-all version before going live.

What robots.txt can and cannot do

Can - Request (politely) that crawlers skip certain paths, conserving crawl budget and reducing discovery of low-value pages.
Can - Advertise your sitemap location so crawlers discover your pages more quickly. Multiple Sitemap: lines are allowed.
Can - Selectively block AI training crawlers (GPTBot, ClaudeBot, Google-Extended, etc.). Major AI companies have publicly committed to honoring robots.txt.
Cannot - remove indexed URLs - Blocking a URL that is already indexed may actually keep it indexed longer, because Google can no longer re-crawl the page to read a noindex directive. Use <meta name="robots" content="noindex"> or the URL Removal tool in Search Console instead.
Cannot - secure a page - robots.txt is a public file. Listing a restricted path in it just advertises that something interesting is there. Access control belongs in authentication, IP allowlists, or a firewall - not robots.txt.
Cannot - enforce anything - Malicious bots and scrapers ignore robots.txt entirely. Only well-behaved crawlers (search engines and reputable AI companies) honor it voluntarily.

Pre-deployment checklist

Reachability - Open https://yourdomain/robots.txt in a browser and confirm it returns HTTP 200 with the correct content. Each subdomain needs its own file.
Google Search Console robots.txt report - Search Console shows the parsed state, any syntax errors, the cached version, and the date it was last fetched.
Bing Webmaster Tools robots.txt Tester - Instantly checks whether a specific URL is allowed or blocked for Bingbot.
Case sensitivity and trailing slashes - URLs are case-sensitive. /Admin/ and /admin/ are different paths. The presence or absence of a trailing slash also matters.
Wildcards - * matches any sequence of characters; $ anchors the end of a string. For example, Disallow: /*.pdf$ blocks all PDF files.
Comments - Lines starting with # are ignored by crawlers. Annotate your rules so future you can debug them without guessing.
Propagation time - Crawlers typically cache robots.txt for up to 24 hours. For urgent changes, request a manual re-crawl via Search Console.

Common mistakes

Accidentally deploying Disallow: / - This can wipe your entire site from search results. Always verify immediately after deploying.
Blocking CSS and JavaScript - Disallowing /assets/ or /static/ prevents Google from rendering your pages correctly, harming your Core Web Vitals and mobile-friendliness scores.
Writing Noindex: in robots.txt - Google stopped supporting this directive in 2019. Use the noindex meta tag or an X-Robots-Tag HTTP header instead.
Inline comments on rule lines - Disallow: /admin/ # admin area can be misinterpreted by some parsers. Put comments on their own separate line.
Not using a sitemap index for large sites - If you have many URLs spread across multiple sitemaps, create a sitemap index file and point the Sitemap: directive at that single index URL.

To strengthen your SEO setup alongside robots.txt, add structured data with the JSON-LD Generator and use the Schema.org Types Reference to choose markup that matches the page. You can also make your site recognizable with the Favicon Generator.

FAQ

Is robots.txt required?

No. Without one, crawlers assume they can access all pages. However, it helps prevent indexing of admin pages, internal search results, and other unwanted content.

Can robots.txt completely prevent indexing?

No. robots.txt is a recommendation, not enforcement. URLs can still be indexed via external links. For full blocking, also use <meta name="robots" content="noindex">.

Can I block AI crawlers like GPTBot or ClaudeBot?

Yes. Major AI crawlers (GPTBot, ChatGPT-User, ClaudeBot, Google-Extended, CCBot) respect robots.txt. Set their User-agent with Disallow: / to block training data collection.

Do all crawlers support Crawl-delay?

Google ignores Crawl-delay. Bing, Yandex, and Naver (Yeti) support it. For Google, adjust crawl rate via Search Console.

Where should I place the robots.txt file?

It must be at the domain root, e.g. https://example.com/robots.txt. Placing it in a subdirectory (e.g. /blog/robots.txt) will not be recognized by crawlers.