🔒 Security & Privacy

AI Crawlers Are Bankrupting Small Sites—Block Them Before Your Bill Arrives

Your site's humming along, serving real readers. Then bam—AI crawlers like Meta's ExternalAgent devour gigabytes of bandwidth, spiking your bills and slowing everything down.

Server logs graph spiking with Meta-ExternalAgent crawler traffic

⚡ Key Takeaways

  • Audit logs now—AI crawlers like Meta's are silently spiking your bandwidth. 𝕏
  • Umami + server blocks = cheap, effective defense without selling out to trackers. 𝕏
  • This sparks a web tollbooth future; block today to shape tomorrow. 𝕏
``` It's tiny—under 2KB—GDPR-ready, and gives a dashboard that cuts through fluff. Pair it with raw logs, though. Umami filters polite bots but misses the gorillas smashing your door. Plausible? Even sleeker. Hosted from $9/month, or self-host free. Their script's dead simple: ``` ``` Fathom's the paid pro, $15 up, no self-host but bulletproof. None solo-stops crawlers—they baseline human traffic so anomalies scream. Here's the table that crystallized it for me: | Feature | Umami | Plausible | Fathom | |---|---|---|---| | Self-hosted | Yes | Yes | No | | Open source | Yes | Yes | No | | GDPR (no cookies) | Yes | Yes | Yes | | Free tier | Self-host | Self-host | No | | Hosted | N/A | $9/mo | $15/mo | | API | Yes | Yes | Yes | | Bot filter | Basic | Basic | Basic | Spot the pattern? Self-hosting wins for control—and irony: you're dodging one data-hoover with tools that don't feed the beast. ## Is robots.txt Enough to Stop Meta and OpenAI Bots? Polite ask? robots.txt. Slap this in: ``` User-agent: Meta-ExternalAgent Disallow: / User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / # etc. User-agent: Googlebot Allow: / ``` Nice theory. Reality? Voluntary. Meta claims respect, but post-harm. Others? Laughable. Real wall: server blocks. Nginx rules, baby: ``` map $http_user_agent $is_ai_crawler { default 0; ~*Meta-ExternalAgent 1; ~*GPTBot 1; # Add more } if ($is_ai_crawler) { return 403; } ``` Apache? .htaccess RewriteCond magic—snip the half-baked original, but it works same: pattern-match user-agents, slam the door. Why this combo? Layers. robots.txt for ethics, blocks for teeth. And monitor—set alerts on traffic spikes. I scripted mine to Slack me at 2x baseline. Look, companies spin this as 'innovation needs data.' Bull. It's piracy dressed as progress. Your site's not public domain—it's your rent. Prediction: by 2025, we'll see crawler micropayments or federated datasets, but until then, arm up. Deeper why: these bots hit at 10-100x human rates, no JS execution, straight HTTP GETs. Architectural mismatch—web built for browsers, not bulk scrapers. Fix? Rate-limit unknowns, fingerprint anomalies. Tools like Fail2Ban tune for this now. One site I audited? Dropped 40% bandwidth post-block. Pages loaded 200ms faster. Real people felt it—fewer bounces. Corporate hype alert: AI firms say 'opt-out via robots.txt.' Too late, and ignores the scraping that's already happened. Don't buy it. ## Why Letting Them In Could Cost You Thousands Short answer: it will. Costs stack—bandwidth, CPU, opportunity. A 1GB site? Meta could slurp it hourly. Multiply by days. But tradeoffs. Block too hard, miss legit indexing? Nah—Googlebot separate. Anthropic? They claim ClaudeBot honors txt; test yours. My stack now: Umami + nginx map + Cloudflare WAF rules for extras like Bytespider (TikTok's sneak). Peace restored. --- ### 🧬 Related Insights - **Read more:** [Rust 1.94.1 Drops: Swift Fixes for Crashes, Cert Woes, and Sneaky CVEs](https://opensourcebeat.com/article/announcing-rust-1941/) - **Read more:** [AI Agents That Lie Less: A No-BS Framework for Self-Awareness](https://opensourcebeat.com/article/how-to-give-your-ai-agent-self-awareness-a-practical-framework/) Frequently Asked Questions What are the best tools to detect AI crawlers? Umami or Plausible for baselines, server logs for truth. Self-host to own it. How to block AI crawlers on nginx? Use a user-agent map to 403 bad actors—full config above. Does robots.txt stop GPTBot and Meta bots? It's a request they might ignore; pair with server blocks.
Published by

theAIcatchup

Community-driven. Code-first.

Worth sharing?

Get the best Open Source stories of the week in your inbox — no noise, no spam.

Originally reported by Dev.to

Stay in the loop

The week's most important stories from theAIcatchup, delivered once a week.