Rate Limiting AI Crawler Bots with ModSecurity

AI training bots from OpenAI, Anthropic, Amazon, and a dozen other companies are now hitting production web servers with the same aggression as a DDoS attack, and robots.txt isn’t stopping them. This guide walks through how InMotion’s systems team uses ModSecurity to enforce per-bot rate limiting at the server level, without cutting off your site’s indexing exposure entirely.

Table of Contents

The Problem: AI Bots That Don’t Follow the Rules

robots.txt has been the de facto agreement between websites and web crawlers for decades. A directive like Crawl-delay: 10 tells compliant bots to wait 10 seconds between requests. Google gives you a way to configure crawl rate through Google Search Console. Traditional search crawlers have operated within these boundaries long enough that most sysadmins never thought much about them.

LLM training crawlers are a different story.

Starting in 2024, InMotion’s systems administration teams began seeing a pattern of unusually heavy traffic across shared and dedicated infrastructure. The source wasn’t a single bot running wild. It was several bots, each operated by a different AI company, simultaneously crawling the same servers with no delay between requests and no respect for Crawl-delay directives. None of them coordinated with each other. None of them needed to. The combined load of GPTBot, ClaudeBot, Amazonbot, and their peers hitting the same server concurrently produces resource exhaustion that looks functionally identical to an unintentional distributed denial-of-service attack.

That surprises a lot of website owners who assume robots.txt is binding. It isn’t. It’s a convention, and these bots aren’t observing it.

Two Options, One Clear Tradeoff

The blunt instrument is a full block via .htaccess. You can deny access by User-Agent and the bots stop hitting your server entirely. Problem solved, except it isn’t: your site also disappears from AI-driven discovery systems. For businesses that want to appear in AI-generated answers or LLM-powered search features, blocking training crawlers entirely carries a real long-term cost.

Rate limiting is the better path. You slow the bots down to a pace your server can absorb. They still index your content. You still maintain visibility. And when a bot refuses to respect the rate limit you’ve set, you block that specific request rather than the bot permanently.

How ModSecurity Rate Limiting Works

ModSecurity is an open-source Web Application Firewall that operates inside Apache or Nginx, inspecting HTTP traffic in real time. It’s the same tool that blocks SQL injection attempts and cross-site scripting attacks on properly hardened servers. What makes it useful here is its ability to track request frequency by User-Agent and deny requests that exceed a defined threshold.

The approach works in two steps:

Identify the incoming request by User-Agent string and increment a per-host counter.
If that counter exceeds the allowed limit before it expires, deny the request with a 429 Too Many Requests response and set a Retry-After header.

That Retry-After header matters. It explicitly tells the bot how long to wait before its next request. A well-behaved crawler will honor it. One that doesn’t get blocked on its next attempt.

The ModSecurity Rules

Below are the rate-limiting rules InMotion Hosting’s systems team developed and currently deploys. Each rule set targets a specific bot by User-Agent and enforces a maximum of one request per 3 seconds per hostname.

GPTBot (OpenAI)

# Limit GPTBot hits by user agent to one hit per 3 seconds
SecRule REQUEST_HEADERS:User-Agent "@pm GPTBot" \
    "id:13075,phase:2,nolog,pass,\
    setuid:%{request_headers.host},\
    setvar:user.ratelimit_gptbot=+1,\
    expirevar:user.ratelimit_gptbot=3"

SecRule USER:RATELIMIT_GPTBOT "@gt 1" \
    "chain,id:13076,phase:2,deny,status:429,\
    setenv:RATELIMITED_GPTBOT,\
    log,msg:'RATELIMITED GPTBOT'"
    SecRule REQUEST_HEADERS:User-Agent "@pm GPTBot"

Header always set Retry-After "3" env=RATELIMITED_GPTBOT
ErrorDocument 429 "Too Many Requests"

# Limit GPTBot hits by user agent to one hit per 3 seconds
SecRule REQUEST_HEADERS:User-Agent "@pm GPTBot" \
    "id:13075,phase:2,nolog,pass,\
    setuid:%{request_headers.host},\
    setvar:user.ratelimit_gptbot=+1,\
    expirevar:user.ratelimit_gptbot=3"

SecRule USER:RATELIMIT_GPTBOT "@gt 1" \
    "chain,id:13076,phase:2,deny,status:429,\
    setenv:RATELIMITED_GPTBOT,\
    log,msg:'RATELIMITED GPTBOT'"
    SecRule REQUEST_HEADERS:User-Agent "@pm GPTBot"

Header always set Retry-After "3" env=RATELIMITED_GPTBOT
ErrorDocument 429 "Too Many Requests"

ClaudeBot (Anthropic)

# Limit ClaudeBot hits by user agent to one hit per 3 seconds
SecRule REQUEST_HEADERS:User-Agent "@pm ClaudeBot" \
    "id:13077,phase:2,nolog,pass,\
    setuid:%{request_headers.host},\
    setvar:user.ratelimit_claudebot=+1,\
    expirevar:user.ratelimit_claudebot=3"

SecRule USER:RATELIMIT_CLAUDEBOT "@gt 1" \
    "chain,id:13078,phase:2,deny,status:429,\
    setenv:RATELIMITED_CLAUDEBOT,\
    log,msg:'RATELIMITED CLAUDEBOT'"
    SecRule REQUEST_HEADERS:User-Agent "@pm ClaudeBot"

Header always set Retry-After "3" env=RATELIMITED_CLAUDEBOT
ErrorDocument 429 "Too Many Requests"

# Limit ClaudeBot hits by user agent to one hit per 3 seconds
SecRule REQUEST_HEADERS:User-Agent "@pm ClaudeBot" \
    "id:13077,phase:2,nolog,pass,\
    setuid:%{request_headers.host},\
    setvar:user.ratelimit_claudebot=+1,\
    expirevar:user.ratelimit_claudebot=3"

SecRule USER:RATELIMIT_CLAUDEBOT "@gt 1" \
    "chain,id:13078,phase:2,deny,status:429,\
    setenv:RATELIMITED_CLAUDEBOT,\
    log,msg:'RATELIMITED CLAUDEBOT'"
    SecRule REQUEST_HEADERS:User-Agent "@pm ClaudeBot"

Header always set Retry-After "3" env=RATELIMITED_CLAUDEBOT
ErrorDocument 429 "Too Many Requests"

Amazonbot

# Limit Amazonbot hits by user agent to one hit per 3 seconds
SecRule REQUEST_HEADERS:User-Agent "@pm Amazonbot" \
    "id:13079,phase:2,nolog,pass,\
    setuid:%{request_headers.host},\
    setvar:user.ratelimit_amazonbot=+1,\
    expirevar:user.ratelimit_amazonbot=3"

SecRule USER:RATELIMIT_AMAZONBOT "@gt 1" \
    "chain,id:13080,phase:2,deny,status:429,\
    setenv:RATELIMITED_AMAZONBOT,\
    log,msg:'RATELIMITED AMAZONBOT'"
    SecRule REQUEST_HEADERS:User-Agent "@pm Amazonbot"

Header always set Retry-After "3" env=RATELIMITED_AMAZONBOT
ErrorDocument 429 "Too Many Requests"

# Limit Amazonbot hits by user agent to one hit per 3 seconds
SecRule REQUEST_HEADERS:User-Agent "@pm Amazonbot" \
    "id:13079,phase:2,nolog,pass,\
    setuid:%{request_headers.host},\
    setvar:user.ratelimit_amazonbot=+1,\
    expirevar:user.ratelimit_amazonbot=3"

SecRule USER:RATELIMIT_AMAZONBOT "@gt 1" \
    "chain,id:13080,phase:2,deny,status:429,\
    setenv:RATELIMITED_AMAZONBOT,\
    log,msg:'RATELIMITED AMAZONBOT'"
    SecRule REQUEST_HEADERS:User-Agent "@pm Amazonbot"

Header always set Retry-After "3" env=RATELIMITED_AMAZONBOT
ErrorDocument 429 "Too Many Requests"

Adapting the Rules for Other Bots

The structure is the same for every bot. To add coverage for a new crawler, copy any rule set and make two changes:

Replace the User-Agent string (e.g., GPTBot) with the new bot’s identifier.
Assign unique id values and unique env variable names to avoid conflicts with existing rules.

The id field must be unique across your entire ModSecurity configuration. If you’re adding these to an existing ruleset, check what IDs are already in use before assigning new ones. Collisions cause rules to fail silently.

For reference, a growing list of known AI crawler User-Agent strings includes Bytespider, CCBot, Google-Extended, Meta-ExternalAgent, and PerplexityBot, among others. The Dark Visitors project maintains a reasonably current catalogue of known AI agent identifiers.

What Happens After You Deploy

Once these rules are active, a bot that makes two requests to the same hostname within a 3-second window receives a 429 on the second request. The Retry-After: 3 header tells it to wait before trying again.

From there, behavior splits into two categories:

Bots that respect the header slow down automatically. They continue indexing your content at a pace your server can handle. Resources are conserved, and your site stays accessible to the crawlers worth caring about.

Bots that ignore the header keep hitting the deny rule on every subsequent request until their internal retry logic kicks in or they move on. Either way, they’re consuming a fraction of the resources they would have without rate limiting in place.

You won’t fix the underlying problem of AI companies deploying aggressive crawlers without consent. But you stop absorbing the cost of their indexing operations on your hardware.

Prerequisites and Where to Apply These Rules

These rules require ModSecurity to be installed and enabled on your server. On InMotion Hosting Dedicated Servers and VPS plans, ModSecurity is available through cPanel’s WHM interface under Security Center > ModSecurity. The rules can be added as custom rules through WHM or directly in your server’s ModSecurity configuration directory.

If you’re on a managed dedicated server, InMotion Hosting’s Advanced Product Support team can assist with custom ModSecurity rule deployment. Customers with Premier Care have access to InMotion Solutions for exactly this kind of custom server configuration work.

Shared hosting environments don’t support custom ModSecurity rules at the account level. If aggressive bot traffic is a problem on shared hosting, the options are limited to .htaccess blocks or upgrading to a VPS or dedicated server where you have full WAF configurability.

A Note on robots.txt

None of this replaces a well-structured robots.txt file. Keeping crawl-delay directives in place for compliant bots remains worthwhile, and explicitly listing AI crawlers you want to restrict adds a documented signal of intent, even if some bots ignore it. The ModSecurity rules handle enforcement for the ones that won’t self-regulate.

robots.txt for bots that respect conventions; ModSecurity rate limiting for the ones that don’t. The two layers work together.

Summary

AI training crawlers don’t observe robots.txt the way traditional search bots do, and the combined load from multiple simultaneous indexing operations can degrade server performance for legitimate traffic. ModSecurity’s User-Agent-based rate limiting gives you server-side control over how frequently these bots can request resources, without requiring you to block them from indexing your site entirely.

The rules are straightforward to deploy, extend to any bot by copying the template, and provide explicit signaling via Retry-After headers for crawlers that are capable of honoring them.

If you’re seeing unexplained spikes in server load or HTTP request volume that don’t correlate with real user traffic, check your access logs for AI crawler User-Agents before assuming you’re dealing with something more complex.

Share this Article