How To Protect Your Server CPU From Aggressive AI-Related Crawlers

AI-related crawlers abusing web servers

If AI-related crawlers are aggressively crawling your web server or site and resulting in high CPU load, then blocking or throttling them is prudent, especially if:

You don't want your content used for AI training or enhanced AI features.
They are not delivering measurable value (e.g., traffic, SEO, brand exposure).
You're seeing negative performance impacts.

Unless you're running a news outlet, knowledge base, or SEO-sensitive site relying on AI-enhanced visibility, the benefit of these crawlers is likely minimal.

To control how our content is used or scraped by LLMs and generative AI platforms, use the solution below, covering both robots.txt and server-level blocks)

Recommended `robots.txt` to Block AI Crawlers

Place this at the root of your site (/robots.txt):

robots.txt


      # Default rule for all bots
      User-agent: *
      Allow: /

      # Block AI and LLM-related crawlers
      User-agent: GPTBot
      Disallow: /

      User-agent: ChatGPT-User
      Disallow: /

      User-agent: Google-Extended
      Disallow: /

      User-agent: Googlebot-Extended
      Disallow: /

      User-agent: Bard-Extended
      Disallow: /

      User-agent: ClaudeBot
      Disallow: /

      User-agent: Anthropic-AI
      Disallow: /

      User-agent: CCBot
      Disallow: /

      User-agent: facebookexternalhit
      Disallow: /

      User-agent: Bytespider
      Disallow: /

      # Sitemap declarations
      Sitemap: https://www.example_domain.com/sitemap.xml
      Sitemap: https://www.example_domainc.com/sitemap_index.xml
      Sitemap: https://www.example_domain.com/sitemap.txt

Why do these matter?

Bot/User-Agent	Company	Purpose
`GPTBot`	OpenAI	Crawls for training ChatGPT
`ChatGPT-User`	OpenAI (via plugins, scraping)	Refers to ChatGPT browsing-related tools
`ClaudeBot`, `Anthropic-AI`	Anthropic	Crawls for Claude models
`Google-Extended`	Google	AI model training (Gemini, Bard)
`Googlebot-Extended`	Google	Enhanced search or AI features
`Bard-Extended`	Google (alias)	Possible alias/test bot for Bard
`CCBot`	Common Crawl	Open-source dataset used to train most LLMs
`Bytespider`	ByteDance/TikTok	Scraping content for training models
`facebookexternalhit`	Facebook	Previews / link scraping

Optional Server-Level Blocking (Recommended for Resource Protection)

Apache / LiteSpeed (`.htaccess`)

Apache / LiteSpeed (.htaccess)


      
        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
        SetEnvIfNoCase User-Agent "CCBot" bad_bot
        SetEnvIfNoCase User-Agent "Bytespider" bad_bot
        SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
    

    
        Require all granted
        Require not env bad_bot

NGINX

NGINX Configuration


      if ($http_user_agent ~* (GPTBot|ChatGPT|ClaudeBot|Anthropic-AI|Google-Extended|Googlebot-Extended|Bard-Extended|CCBot|Bytespider|facebookexternalhit)) {
    return 403;
      }

Adding AI Bot Block to Apache via WHM Includes (cPanel)

Suppose you are a root admin and want to integrate this AI/LLM bot-blocking logic into cPanel/WHM's Apache configuration.

In that case, you can do that using Pre VirtualHost Includes, and here's exactly how to do it, step-by-step:

1. Log into WHM

Navigate to:
WHM » Service Configuration » Apache Configuration » Include Editor

2. Edit Pre VirtualHost Include

Under "Pre VirtualHost Include", select:
- All Versions
Then click "Edit" to open the editor.

3. Paste the Block Code Below

WHM Pre VirtualHost Include Code


      
        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
        SetEnvIfNoCase User-Agent "CCBot" bad_bot
        SetEnvIfNoCase User-Agent "Bytespider" bad_bot
        SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]

4. Save and Rebuild Apache

Click "Update", then "Restart Apache" when prompted.

Verify Deployment


      $ curl -A "ClaudeBot" https://example_domain.com

You should see a 403 Forbidden.

Tip for Admins

This global include ensures newly created vhosts also get protection by default.
Use /etc/apache2/logs/error_log or domain-specific logs to spot access denials.

Strategic Note

If you're using Plesk, you may want to:

Add .htaccess rules via Plesk File Manager or CLI in each site's root
Monitor web server logs (/var/www/vhosts/system/DOMAIN/logs/) for bot hits
Consider Fail2Ban or ModSecurity if abuse is ongoing

Apache 2.4+ Bot Block for AI & LLM Crawlers

Apache 2.4+ Configuration


      
        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
        SetEnvIfNoCase User-Agent "CCBot" bad_bot
        SetEnvIfNoCase User-Agent "Bytespider" bad_bot
        SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
      

      
          Require all granted
          Require not env bad_bot

Location

If this is for a specific domain, place it in:
/var/www/vhosts/example_domain.com/httpdocs/.htaccess
If this is for all vhosts, add to:
/etc/apache2/conf-enabled/security.conf or another appropriate global config

Testing Suggestions

After applying, run:


          $ curl -A "GPTBot" https://example_domain.com

You should get a 403 Forbidden response.

Confirm Apache is reading your .htaccess:
Make sure AllowOverride All or at least AllowOverride FileInfo is enabled in your block.

Target Only Specific Paths

Blocking the entire website from AI-related crawlers isn't always the best approach. Let's target only /knowledgebase/ and /knowledgebase.php:

This allows us to:

Keep high CPU bots from crawling resource-heavy help docs.
Still allow them to index your homepage or marketing pages (if needed).
Avoid over-blocking legitimate crawlers or users.

Then adjust both Apache and LiteSpeed rules to limit the block only to those paths.

For Apache 2.4+ (WHM Include)

Use this instead in the Pre VirtualHost Include:

Target Specific Paths in Apache


      
      SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
      SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path

      SetEnvIf kb_path 1 is_kb

      SetEnvIfNoCase User-Agent "GPTBot" bad_bot
      SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
      SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
      SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
      SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
      SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
      SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
      SetEnvIfNoCase User-Agent "CCBot" bad_bot
      SetEnvIfNoCase User-Agent "Bytespider" bad_bot
      SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]

It is expected that this will:

Combine both URI targeting and bot fingerprinting.
Only denies access if the bot is bad and the path is /knowledgebase or /knowledgebase.php.

❗WARNING:

Note that mod_authz_core directives like Require are not allowed in the pre_virtualhost_global.conf include because it's outside a block. So, let's replicate the same logic usingmod_rewrite, which works fine in pre_virtualhost_global.conf.

Let's apply our modification:

Go to WHM → Service Configuration → Apache Configuration → Include Editor
Choose Pre VirtualHost Include → All Versions
Paste the corrected block above
Save and Rebuild Configuration
Restart Apache from WHM or via sudo /scripts/restartsrv_apache

For LiteSpeed or `.htaccess`

If you're using LiteSpeed or working at the per-account level .htaccess level:

LiteSpeed Configuration


      
        RewriteEngine On

        # Block AI bots only for specific paths
        RewriteCond %{REQUEST_URI} ^/knowledgebase($|/) [NC,OR]
        RewriteCond %{REQUEST_URI} ^/knowledgebase\.php$ [NC]
        RewriteCond %{HTTP_USER_AGENT} (GPTBot|ChatGPT|ClaudeBot|Anthropic-AI|Google-Extended|Googlebot-Extended|Bard-Extended|CCBot|Bytespider|facebookexternalhit) [NC]
        RewriteRule ^.* - [F,L]

Plesk Apache Configuration Guidance

Let's set this up for Debian/Ubuntu-based Plesk stack (or AlmaLinux/Rocky/CentOS-based).

Option A: Use Additional Apache Directives (per domain)

Plesk Panel → Domains → [example_domain] → Apache & nginx Settings
Scroll to Additional Apache directives
Paste this under "Additional directives for HTTP and HTTPS":

Plesk Apache Additional Directives


      
        SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
        SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path

        SetEnvIf kb_path 1 is_kb

        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]

Save & apply, and Plesk will reload Apache for you.

`.htaccess` Override (Individual Site Level)

This is ideal if:

You own the server and don't want to modify global Apache configs
You're managing a single domain without root or Plesk admin panel access

Drop this into `.htaccess` (in site root or `/knowledgebase/`)

.htaccess Configuration


      
        SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
        SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path

        SetEnvIf kb_path 1 is_kb

        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]

If .htaccess rules don't seem to take effect, confirm:

AllowOverride All is enabled for your directory in Apache config.
The .htaccess file is in the right directory and has no syntax issues.

TL;DR — When to Use What?

Use Case	Method
WHM/cPanel Server	WHM Include Editor (Pre VirtualHost)
Plesk Panel Admin (per domain)	Apache & nginx Settings → Additional Directives
Individual site (any Apache)	`.htaccess` in site or `/knowledgebase/`
All servers, consistent policy	Use all three for defense in depth

Test Your Configuration


      $ curl -A "GPTBot" https://example_domain.com/knowledgebase/         # ➤ Should return 403
      $ curl -A "GPTBot" https://example_domain.com/index.html             # ➤ Should return 200 OK
      $ curl -A "Mozilla" https://example_domain.com/knowledgebase/        # ➤ Should return 200 OK

By Trax Armstrong, Cloud Infrastructure Specialist with 10+ years of expertise in AWS & Google Cloud, Linux systems administration, cPanel/WHM, and server security. Technical lead for global infrastructure solutions specializing in web hosting optimization, cloud migrations, and high-performance server configurations.

Recommended `robots.txt` to Block AI Crawlers

Optional Server-Level Blocking (Recommended for Resource Protection)

Apache / LiteSpeed (`.htaccess`)

NGINX

Adding AI Bot Block to Apache via WHM Includes (cPanel)

1. Log into WHM

2. Edit Pre VirtualHost Include

3. Paste the Block Code Below

4. Save and Rebuild Apache

Verify Deployment

Tip for Admins

Strategic Note

Apache 2.4+ Bot Block for AI & LLM Crawlers

Location

Testing Suggestions

Target Only Specific Paths

For Apache 2.4+ (WHM Include)

For LiteSpeed or `.htaccess`

Plesk Apache Configuration Guidance

Option A: Use Additional Apache Directives (per domain)

`.htaccess` Override (Individual Site Level)

Drop this into `.htaccess` (in site root or `/knowledgebase/`)

TL;DR — When to Use What?

Test Your Configuration

Top Articles

Products

Services

Support

How To Protect Your Server CPU From Aggressive AI-Related Crawlers

Recommended robots.txt to Block AI Crawlers

Optional Server-Level Blocking (Recommended for Resource Protection)

Apache / LiteSpeed (.htaccess)

NGINX

Adding AI Bot Block to Apache via WHM Includes (cPanel)

1. Log into WHM

2. Edit Pre VirtualHost Include

3. Paste the Block Code Below

4. Save and Rebuild Apache

Verify Deployment

Tip for Admins

Strategic Note

Apache 2.4+ Bot Block for AI & LLM Crawlers

Location

Testing Suggestions

Target Only Specific Paths

For Apache 2.4+ (WHM Include)

For LiteSpeed or .htaccess

Plesk Apache Configuration Guidance

Option A: Use Additional Apache Directives (per domain)

.htaccess Override (Individual Site Level)

Drop this into .htaccess (in site root or /knowledgebase/)

TL;DR — When to Use What?

Test Your Configuration

Top Articles

Products

Services

Support

Generate Your Password

Recommended `robots.txt` to Block AI Crawlers

Apache / LiteSpeed (`.htaccess`)

For LiteSpeed or `.htaccess`

`.htaccess` Override (Individual Site Level)

Drop this into `.htaccess` (in site root or `/knowledgebase/`)