If AI-related crawlers are aggressively crawling your web server or site and resulting in high CPU load, then blocking or throttling them is prudent, especially if:
- You don't want your content used for AI training or enhanced AI features.
- They are not delivering measurable value (e.g., traffic, SEO, brand exposure).
- You're seeing negative performance impacts.
Unless you're running a news outlet, knowledge base, or SEO-sensitive site relying on AI-enhanced visibility, the benefit of these crawlers is likely minimal.
To control how our content is used or scraped by LLMs and generative AI platforms, use the solution below, covering both robots.txt
and server-level blocks)
Recommended robots.txt
to Block AI Crawlers
Place this at the root of your site (/robots.txt
):
# Default rule for all bots
User-agent: *
Allow: /
# Block AI and LLM-related crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Googlebot-Extended
Disallow: /
User-agent: Bard-Extended
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Anthropic-AI
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: facebookexternalhit
Disallow: /
User-agent: Bytespider
Disallow: /
# Sitemap declarations
Sitemap: https://www.example_domain.com/sitemap.xml
Sitemap: https://www.example_domainc.com/sitemap_index.xml
Sitemap: https://www.example_domain.com/sitemap.txt
Why do these matter?
Bot/User-Agent | Company | Purpose |
---|---|---|
GPTBot |
OpenAI | Crawls for training ChatGPT |
ChatGPT-User |
OpenAI (via plugins, scraping) | Refers to ChatGPT browsing-related tools |
ClaudeBot , Anthropic-AI |
Anthropic | Crawls for Claude models |
Google-Extended |
AI model training (Gemini, Bard) | |
Googlebot-Extended |
Enhanced search or AI features | |
Bard-Extended |
Google (alias) | Possible alias/test bot for Bard |
CCBot |
Common Crawl | Open-source dataset used to train most LLMs |
Bytespider |
ByteDance/TikTok | Scraping content for training models |
facebookexternalhit |
Previews / link scraping |
Optional Server-Level Blocking (Recommended for Resource Protection)
Apache / LiteSpeed (.htaccess
)
SetEnvIfNoCase User-Agent "GPTBot" bad_bot
SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
SetEnvIfNoCase User-Agent "CCBot" bad_bot
SetEnvIfNoCase User-Agent "Bytespider" bad_bot
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
Require all granted
Require not env bad_bot
NGINX
if ($http_user_agent ~* (GPTBot|ChatGPT|ClaudeBot|Anthropic-AI|Google-Extended|Googlebot-Extended|Bard-Extended|CCBot|Bytespider|facebookexternalhit)) {
return 403;
}
Adding AI Bot Block to Apache via WHM Includes (cPanel)
Suppose you are a root admin and want to integrate this AI/LLM bot-blocking logic into cPanel/WHM's Apache configuration.
In that case, you can do that using Pre VirtualHost Includes, and here's exactly how to do it, step-by-step:
1. Log into WHM
- Navigate to:
WHM » Service Configuration » Apache Configuration » Include Editor
2. Edit Pre VirtualHost Include
- Under "Pre VirtualHost Include", select:
- All Versions
- Then click "Edit" to open the editor.
3. Paste the Block Code Below
SetEnvIfNoCase User-Agent "GPTBot" bad_bot
SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
SetEnvIfNoCase User-Agent "CCBot" bad_bot
SetEnvIfNoCase User-Agent "Bytespider" bad_bot
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
RewriteEngine On
RewriteCond %{ENV:is_kb} ^1$
RewriteCond %{ENV:bad_bot} ^1$
RewriteRule ^.* - [F,L]
4. Save and Rebuild Apache
- Click "Update", then "Restart Apache" when prompted.
Verify Deployment
$ curl -A "ClaudeBot" https://example_domain.com
You should see a 403 Forbidden
.
Tip for Admins
- This global include ensures newly created vhosts also get protection by default.
- Use
/etc/apache2/logs/error_log
or domain-specific logs to spot access denials.
Strategic Note
If you're using Plesk, you may want to:
- Add
.htaccess
rules via Plesk File Manager or CLI in each site's root - Monitor web server logs (
/var/www/vhosts/system/DOMAIN/logs/
) for bot hits - Consider Fail2Ban or ModSecurity if abuse is ongoing
Apache 2.4+ Bot Block for AI & LLM Crawlers
SetEnvIfNoCase User-Agent "GPTBot" bad_bot
SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
SetEnvIfNoCase User-Agent "CCBot" bad_bot
SetEnvIfNoCase User-Agent "Bytespider" bad_bot
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
Require all granted
Require not env bad_bot
Location
- If this is for a specific domain, place it in:
/var/www/vhosts/example_domain.com/httpdocs/.htaccess
- If this is for all vhosts, add to:
/etc/apache2/conf-enabled/security.conf
or another appropriate global config
Testing Suggestions
- After applying, run:
$ curl -A "GPTBot" https://example_domain.com
403 Forbidden
response. - Confirm Apache is reading your
.htaccess
:
Make sureAllowOverride All
or at leastAllowOverride FileInfo
is enabled in yourblock.
Target Only Specific Paths
Blocking the entire website from AI-related crawlers isn't always the best approach. Let's target only /knowledgebase/
and /knowledgebase.php
:
This allows us to:
- Keep high CPU bots from crawling resource-heavy help docs.
- Still allow them to index your homepage or marketing pages (if needed).
- Avoid over-blocking legitimate crawlers or users.
Then adjust both Apache and LiteSpeed rules to limit the block only to those paths.
For Apache 2.4+ (WHM Include)
Use this instead in the Pre VirtualHost Include:
SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path
SetEnvIf kb_path 1 is_kb
SetEnvIfNoCase User-Agent "GPTBot" bad_bot
SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
SetEnvIfNoCase User-Agent "CCBot" bad_bot
SetEnvIfNoCase User-Agent "Bytespider" bad_bot
SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
RewriteEngine On
RewriteCond %{ENV:is_kb} ^1$
RewriteCond %{ENV:bad_bot} ^1$
RewriteRule ^.* - [F,L]
It is expected that this will:
- Combine both URI targeting and bot fingerprinting.
- Only denies access if the bot is bad and the path is
/knowledgebase
or/knowledgebase.php
.
❗WARNING:
Note that mod_authz_core directives like Require are not allowed in the pre_virtualhost_global.conf include because it's outside a block. So, let's replicate the same logic usingmod_rewrite
, which works fine in pre_virtualhost_global.conf
.
Let's apply our modification:
- Go to WHM → Service Configuration → Apache Configuration → Include Editor
- Choose Pre VirtualHost Include → All Versions
- Paste the corrected block above
- Save and Rebuild Configuration
- Restart Apache from WHM or via
sudo /scripts/restartsrv_apache
For LiteSpeed or .htaccess
If you're using LiteSpeed or working at the per-account level .htaccess
level:
RewriteEngine On
# Block AI bots only for specific paths
RewriteCond %{REQUEST_URI} ^/knowledgebase($|/) [NC,OR]
RewriteCond %{REQUEST_URI} ^/knowledgebase\.php$ [NC]
RewriteCond %{HTTP_USER_AGENT} (GPTBot|ChatGPT|ClaudeBot|Anthropic-AI|Google-Extended|Googlebot-Extended|Bard-Extended|CCBot|Bytespider|facebookexternalhit) [NC]
RewriteRule ^.* - [F,L]
Plesk Apache Configuration Guidance
Let's set this up for Debian/Ubuntu-based Plesk stack (or AlmaLinux/Rocky/CentOS-based).
Option A: Use Additional Apache Directives (per domain)
- Plesk Panel → Domains → [example_domain] → Apache & nginx Settings
- Scroll to Additional Apache directives
- Paste this under "Additional directives for HTTP and HTTPS":
SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path
SetEnvIf kb_path 1 is_kb
SetEnvIfNoCase User-Agent "GPTBot" bad_bot
SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
RewriteEngine On
RewriteCond %{ENV:is_kb} ^1$
RewriteCond %{ENV:bad_bot} ^1$
RewriteRule ^.* - [F,L]
- Save & apply, and Plesk will reload Apache for you.
.htaccess
Override (Individual Site Level)
This is ideal if:
- You own the server and don't want to modify global Apache configs
- You're managing a single domain without root or Plesk admin panel access
Drop this into .htaccess
(in site root or /knowledgebase/
)
SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path
SetEnvIf kb_path 1 is_kb
SetEnvIfNoCase User-Agent "GPTBot" bad_bot
SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
RewriteEngine On
RewriteCond %{ENV:is_kb} ^1$
RewriteCond %{ENV:bad_bot} ^1$
RewriteRule ^.* - [F,L]
If .htaccess
rules don't seem to take effect, confirm:
AllowOverride All
is enabled for your directory in Apache config.- The
.htaccess
file is in the right directory and has no syntax issues.
TL;DR — When to Use What?
Use Case | Method |
---|---|
WHM/cPanel Server | WHM Include Editor (Pre VirtualHost) |
Plesk Panel Admin (per domain) | Apache & nginx Settings → Additional Directives |
Individual site (any Apache) | .htaccess in site or /knowledgebase/ |
All servers, consistent policy | Use all three for defense in depth |
Test Your Configuration
$ curl -A "GPTBot" https://example_domain.com/knowledgebase/ # ➤ Should return 403
$ curl -A "GPTBot" https://example_domain.com/index.html # ➤ Should return 200 OK
$ curl -A "Mozilla" https://example_domain.com/knowledgebase/ # ➤ Should return 200 OK