AI-related crawlers abusing web servers

 

If AI-related crawlers are aggressively crawling your web server or site and resulting in high CPU load, then blocking or throttling them is prudent, especially if:

  1. You don't want your content used for AI training or enhanced AI features.
  2. They are not delivering measurable value (e.g., traffic, SEO, brand exposure).
  3. You're seeing negative performance impacts.

Unless you're running a news outlet, knowledge base, or SEO-sensitive site relying on AI-enhanced visibility, the benefit of these crawlers is likely minimal.

To control how our content is used or scraped by LLMs and generative AI platforms, use the solution below, covering both robots.txt and server-level blocks)

Recommended robots.txt to Block AI Crawlers

Place this at the root of your site (/robots.txt):

robots.txt

      # Default rule for all bots
      User-agent: *
      Allow: /

      # Block AI and LLM-related crawlers
      User-agent: GPTBot
      Disallow: /

      User-agent: ChatGPT-User
      Disallow: /

      User-agent: Google-Extended
      Disallow: /

      User-agent: Googlebot-Extended
      Disallow: /

      User-agent: Bard-Extended
      Disallow: /

      User-agent: ClaudeBot
      Disallow: /

      User-agent: Anthropic-AI
      Disallow: /

      User-agent: CCBot
      Disallow: /

      User-agent: facebookexternalhit
      Disallow: /

      User-agent: Bytespider
      Disallow: /

      # Sitemap declarations
      Sitemap: https://www.example_domain.com/sitemap.xml
      Sitemap: https://www.example_domainc.com/sitemap_index.xml
      Sitemap: https://www.example_domain.com/sitemap.txt
  

Why do these matter?

Bot/User-Agent Company Purpose
GPTBot OpenAI Crawls for training ChatGPT
ChatGPT-User OpenAI (via plugins, scraping) Refers to ChatGPT browsing-related tools
ClaudeBot, Anthropic-AI Anthropic Crawls for Claude models
Google-Extended Google AI model training (Gemini, Bard)
Googlebot-Extended Google Enhanced search or AI features
Bard-Extended Google (alias) Possible alias/test bot for Bard
CCBot Common Crawl Open-source dataset used to train most LLMs
Bytespider ByteDance/TikTok Scraping content for training models
facebookexternalhit Facebook Previews / link scraping

 

Optional Server-Level Blocking (Recommended for Resource Protection)

Apache / LiteSpeed (.htaccess)

Apache / LiteSpeed (.htaccess)

      
        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
        SetEnvIfNoCase User-Agent "CCBot" bad_bot
        SetEnvIfNoCase User-Agent "Bytespider" bad_bot
        SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
    

    
        Require all granted
        Require not env bad_bot
    
  

NGINX

NGINX Configuration

      if ($http_user_agent ~* (GPTBot|ChatGPT|ClaudeBot|Anthropic-AI|Google-Extended|Googlebot-Extended|Bard-Extended|CCBot|Bytespider|facebookexternalhit)) {
    return 403;
      }
    

Adding AI Bot Block to Apache via WHM Includes (cPanel)

Suppose you are a root admin and want to integrate this AI/LLM bot-blocking logic into cPanel/WHM's Apache configuration.

In that case, you can do that using Pre VirtualHost Includes, and here's exactly how to do it, step-by-step:

1. Log into WHM

  • Navigate to:
    WHM » Service Configuration » Apache Configuration » Include Editor

2. Edit Pre VirtualHost Include

  • Under "Pre VirtualHost Include", select:
    • All Versions
  • Then click "Edit" to open the editor.

3. Paste the Block Code Below

WHM Pre VirtualHost Include Code

      
        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
        SetEnvIfNoCase User-Agent "CCBot" bad_bot
        SetEnvIfNoCase User-Agent "Bytespider" bad_bot
        SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]
      
    

4. Save and Rebuild Apache

  • Click "Update", then "Restart Apache" when prompted.

Verify Deployment


      $ curl -A "ClaudeBot" https://example_domain.com
    

You should see a 403 Forbidden.

Tip for Admins

  • This global include ensures newly created vhosts also get protection by default.
  • Use /etc/apache2/logs/error_log or domain-specific logs to spot access denials.

Strategic Note

If you're using Plesk, you may want to:

  • Add .htaccess rules via Plesk File Manager or CLI in each site's root
  • Monitor web server logs (/var/www/vhosts/system/DOMAIN/logs/) for bot hits
  • Consider Fail2Ban or ModSecurity if abuse is ongoing

Apache 2.4+ Bot Block for AI & LLM Crawlers

Apache 2.4+ Configuration

      
        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
        SetEnvIfNoCase User-Agent "CCBot" bad_bot
        SetEnvIfNoCase User-Agent "Bytespider" bad_bot
        SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
      

      
          Require all granted
          Require not env bad_bot
      
    

Location

  • If this is for a specific domain, place it in:
    /var/www/vhosts/example_domain.com/httpdocs/.htaccess
  • If this is for all vhosts, add to:
    /etc/apache2/conf-enabled/security.conf or another appropriate global config

Testing Suggestions

  1. After applying, run:
    
              $ curl -A "GPTBot" https://example_domain.com
            
    You should get a 403 Forbidden response.
  2. Confirm Apache is reading your .htaccess:
    Make sure AllowOverride All or at least AllowOverride FileInfo is enabled in your block.

Target Only Specific Paths

Blocking the entire website from AI-related crawlers isn't always the best approach. Let's target only /knowledgebase/ and /knowledgebase.php:

This allows us to:

  • Keep high CPU bots from crawling resource-heavy help docs.
  • Still allow them to index your homepage or marketing pages (if needed).
  • Avoid over-blocking legitimate crawlers or users.

Then adjust both Apache and LiteSpeed rules to limit the block only to those paths.

For Apache 2.4+ (WHM Include)

Use this instead in the Pre VirtualHost Include:

Target Specific Paths in Apache

      
      SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
      SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path

      SetEnvIf kb_path 1 is_kb

      SetEnvIfNoCase User-Agent "GPTBot" bad_bot
      SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
      SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
      SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
      SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
      SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
      SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
      SetEnvIfNoCase User-Agent "CCBot" bad_bot
      SetEnvIfNoCase User-Agent "Bytespider" bad_bot
      SetEnvIfNoCase User-Agent "facebookexternalhit" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]
      
    

It is expected that this will:

  • Combine both URI targeting and bot fingerprinting.
  • Only denies access if the bot is bad and the path is /knowledgebase or /knowledgebase.php.

WARNING:

Note that mod_authz_core directives like Require are not allowed in the pre_virtualhost_global.conf include because it's outside a block. So, let's replicate the same logic usingmod_rewrite, which works fine in pre_virtualhost_global.conf.

Let's apply our modification:

  1. Go to WHM → Service Configuration → Apache Configuration → Include Editor
  2. Choose Pre VirtualHost Include → All Versions
  3. Paste the corrected block above
  4. Save and Rebuild Configuration
  5. Restart Apache from WHM or via sudo /scripts/restartsrv_apache

For LiteSpeed or .htaccess

If you're using LiteSpeed or working at the per-account level .htaccess level:

LiteSpeed Configuration

      
        RewriteEngine On

        # Block AI bots only for specific paths
        RewriteCond %{REQUEST_URI} ^/knowledgebase($|/) [NC,OR]
        RewriteCond %{REQUEST_URI} ^/knowledgebase\.php$ [NC]
        RewriteCond %{HTTP_USER_AGENT} (GPTBot|ChatGPT|ClaudeBot|Anthropic-AI|Google-Extended|Googlebot-Extended|Bard-Extended|CCBot|Bytespider|facebookexternalhit) [NC]
        RewriteRule ^.* - [F,L]
      
    

Plesk Apache Configuration Guidance

Let's set this up for Debian/Ubuntu-based Plesk stack (or AlmaLinux/Rocky/CentOS-based).

Option A: Use Additional Apache Directives (per domain)

  1. Plesk Panel → Domains → [example_domain] → Apache & nginx Settings
  2. Scroll to Additional Apache directives
  3. Paste this under "Additional directives for HTTP and HTTPS":
Plesk Apache Additional Directives

      
        SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
        SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path

        SetEnvIf kb_path 1 is_kb

        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]
      
    
  1. Save & apply, and Plesk will reload Apache for you.

.htaccess Override (Individual Site Level)

This is ideal if:

  • You own the server and don't want to modify global Apache configs
  • You're managing a single domain without root or Plesk admin panel access

Drop this into .htaccess (in site root or /knowledgebase/)

.htaccess Configuration

      
        SetEnvIfNoCase Request_URI "^/knowledgebase($|/)" kb_path
        SetEnvIfNoCase Request_URI "^/knowledgebase\.php$" kb_path

        SetEnvIf kb_path 1 is_kb

        SetEnvIfNoCase User-Agent "GPTBot" bad_bot
        SetEnvIfNoCase User-Agent "ChatGPT" bad_bot
        SetEnvIfNoCase User-Agent "ClaudeBot" bad_bot
        SetEnvIfNoCase User-Agent "Anthropic-AI" bad_bot
        SetEnvIfNoCase User-Agent "Google-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Googlebot-Extended" bad_bot
        SetEnvIfNoCase User-Agent "Bard-Extended" bad_bot
      

      
          RewriteEngine On
          RewriteCond %{ENV:is_kb} ^1$
          RewriteCond %{ENV:bad_bot} ^1$
          RewriteRule ^.* - [F,L]
      
    

If .htaccess rules don't seem to take effect, confirm:

  • AllowOverride All is enabled for your directory in Apache config.
  • The .htaccess file is in the right directory and has no syntax issues.

TL;DR — When to Use What?

Use Case Method
WHM/cPanel Server WHM Include Editor (Pre VirtualHost)
Plesk Panel Admin (per domain) Apache & nginx Settings → Additional Directives
Individual site (any Apache) .htaccess in site or /knowledgebase/
All servers, consistent policy Use all three for defense in depth

Test Your Configuration


      $ curl -A "GPTBot" https://example_domain.com/knowledgebase/         # ➤ Should return 403
      $ curl -A "GPTBot" https://example_domain.com/index.html             # ➤ Should return 200 OK
      $ curl -A "Mozilla" https://example_domain.com/knowledgebase/        # ➤ Should return 200 OK
    

 

By Trax Armstrong, Cloud Infrastructure Specialist with 10+ years of expertise in AWS & Google Cloud, Linux systems administration, cPanel/WHM, and server security. Technical lead for global infrastructure solutions specializing in web hosting optimization, cloud migrations, and high-performance server configurations.

Did this answer help? 0 People found this helpful (0 Votes)