If you've been online since 1998, you know about robots.txt. It's a tiny file at the root of every website that tells search engine crawlers what they can and can't access. For most of the modern internet, that meant one thing: tell Googlebot what to look at.
That world changed in 2023. Major AI platforms started shipping their own crawlers — separate from Google, with their own user-agent strings and their own rules. Today there are at least half a dozen distinct AI bots crawling the web, and most small business owners have no idea any of them exist.
This matters because blocking one of them — even accidentally — makes your business invisible to that platform. A ChatGPT user asking "who's the best contractor in Knoxville?" will never see you if GPTBot can't crawl your site. Same for Claude, Perplexity, and the rest.
The major AI crawlers, at a glance
Here are the bots most worth knowing about — who runs them, what they do, and exactly which user-agent string to recognize in your robots.txt or server logs.
| Bot | Operator | Purpose | User-agent |
|---|---|---|---|
| GPTBot | OpenAI | Training future models and powering ChatGPT browsing | GPTBot |
| OAI-SearchBot | OpenAI | Indexes pages for ChatGPT Search (the real-time answer feature) | OAI-SearchBot |
| ChatGPT-User | OpenAI | Fetches pages on demand when a user asks ChatGPT to read a specific URL | ChatGPT-User |
| ClaudeBot | Anthropic | Crawls content for training and Claude's web tools | ClaudeBot |
| Claude-Web | Anthropic | On-demand fetch when a Claude user references a URL | Claude-Web |
| PerplexityBot | Perplexity | Powers Perplexity's real-time search-style answers | PerplexityBot |
| Google-Extended | Controls whether your content is used for AI Overviews and Gemini training (separate from regular Googlebot) | Google-Extended |
|
| Applebot-Extended | Apple | Controls use of your content in Apple Intelligence (separate from regular Applebot) | Applebot-Extended |
| Bytespider | ByteDance | Crawls content for Doubao and other ByteDance AI | Bytespider |
| CCBot | Common Crawl | Open dataset used by many AI projects to train and evaluate | CCBot |
Two distinctions worth understanding right away:
- Training crawlers vs. live-search crawlers. GPTBot and ClaudeBot mostly crawl for model training. OAI-SearchBot and PerplexityBot pull pages in real time when someone asks a question. Both kinds matter — training shapes what the model knows generally; live search shapes what it cites in a specific answer.
- "Extended" bots are opt-out for AI specifically. Google-Extended and Applebot-Extended exist precisely so site owners can block AI use without blocking regular search ranking. If you allow Googlebot but block Google-Extended, you'll still rank in Google Search but be excluded from AI Overviews.
How to check if you're blocking any of them
The fastest way is to visit your own robots.txt file in a browser. The URL is always the same pattern:
https://yourwebsite.com/robots.txt
Look for any blocks that mention the bots in the table above. A block looks like this:
User-agent: GPTBot Disallow: /
That tells GPTBot it isn't allowed anywhere on your site. If you find this, your business is invisible to ChatGPT.
If you scan the file and don't see any of the AI bots mentioned, you're fine — the default for any unmentioned bot is allow. The Beacon audit checks for all of the major ones automatically and flags any that are blocked.
Why so many sites accidentally block AI crawlers
Three common reasons we see, in order of frequency:
1. A WordPress plugin or theme set it up "to save bandwidth"
Some performance-focused or "AI privacy" plugins now ship with default blocks for AI crawlers. Site owners install the plugin without realizing it added five new Disallow rules. We see this constantly.
2. A 2023-era "block AI training" decision that's overdue for review
When AI training became a hot topic in 2023, lots of site owners added blocks to GPTBot specifically. Some did it for licensing reasons (publishers protecting original content); many did it reflexively. If you're a small business, the original argument never really applied to you — and now that AI is also a discovery channel, you're losing customers to a decision you don't remember making.
3. Server-side blocking at the hosting or CDN layer
Some hosting providers and CDNs (Cloudflare in particular) have offered "Block AI Bots" toggles. If your developer or hosting account ever flipped that switch, the block happens at the network layer — and your robots.txt looks fine even though you're still blocking the bot in practice. Worth checking inside your hosting dashboard, not just your file.
How to let AI crawlers in
The fix is usually one of two things, depending on where the block lives.
In your robots.txt
The simplest option is to leave your robots.txt file silent on AI bots — they default to allowed. If your file currently contains lines like User-agent: GPTBot followed by Disallow: /, you can either delete those blocks or replace them with explicit allows:
User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: Applebot-Extended Allow: /
These are explicit, machine-readable, and remove any ambiguity.
In your hosting or CDN settings
If you use Cloudflare, log into your dashboard and look for any "AI Scraper and Crawler" or "Block AI Bots" toggles under Security → Bots. Make sure they're off (or set to allow). Other hosts have similar settings buried in security or firewall sections — check your hosting provider's docs.
Should you ever block them?
For most small and mid-sized businesses, no. Blocking AI crawlers makes your business unreachable through one of the fastest-growing discovery channels of the next five years. The "we don't want our data used for training" instinct is reasonable for publishers and creators with original commercial content — but for a contractor's About page or a coffee shop's hours, it's hard to articulate a real cost.
The asymmetry matters: the upside of allowing AI crawlers is being discoverable in AI search. The downside, for most businesses, is approximately nothing.
There are exceptions:
- Publishers with original journalism or paywalled content may want to block training crawlers while still allowing live-search crawlers.
- Membership-only or private-platform sites shouldn't expose member data to AI training.
- Businesses with licensed content (stock photos, premium courses, paywalled databases) may have contractual reasons to block.
If you fall into one of those categories, block selectively. Otherwise, let them in.
llms.txt: the other file AI crawlers look for
Once you've cleared the path with robots.txt, the next move is making your site easier to summarize. That's what llms.txt is for — a small, curated file at the root of your site that tells AI crawlers which pages best explain your business. It's not a security file. It's a hospitality file.
Adding one is a 30-minute job and one of the highest-leverage things any small business website can do this year.
Common questions
Should I allow AI crawlers on my business website?
For most small and mid-sized businesses, yes. Blocking them means your business can't be discovered, summarized, or recommended through AI search. Unless you have a specific reason to opt out — paywalled publication, original content you don't want trained on, or licensing concerns — allowing them is the default.
How do I tell if I'm blocking GPTBot or ClaudeBot right now?
Visit https://yourwebsite.com/robots.txt in a browser and look for any User-agent: GPTBot or User-agent: ClaudeBot lines followed by Disallow: /. Those block the bot completely. If you see no mention of those user agents, you're not blocking them — the default is allow. You can also run a free Beacon audit, which checks all the major AI bots automatically.
What's the difference between Googlebot and Google-Extended?
Googlebot crawls your site for regular Google Search results. Google-Extended is a separate, opt-out controller for whether Google can use your content for AI Overviews and Gemini training. You can allow Googlebot for ranking while blocking Google-Extended for AI — but most businesses want both allowed.
Will allowing GPTBot slow my site down?
No, in practice. AI crawlers respect rate limits the same way Googlebot does. Their crawl footprint is small compared to regular search traffic. We've never seen a site experience meaningful load from AI crawlers.
If I unblock them today, when will I show up in ChatGPT?
It depends. Live-search crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User) can find you within hours of unblocking. Training-only crawlers (GPTBot, ClaudeBot's training arm) influence future model updates, which roll out on the vendor's schedule — weeks to months. The Google-Extended path into AI Overviews typically updates within days to a few weeks.