Crazy amount of guests

You need to install and use cloudflare since you don't have the option to use my fail2ban solution, which runs at the server level ( won't work on managed hosting )
Discourage mode is probably computationally not up to the task of handling the 100's of requests per second these bots can dish out.

Every website has this problem with AI scrapers and hosting a website of any appreciable size on the internet is like this today.
Thanks. Appreciate the reply. Well, then I guess I need to switch to Cloudflare.
 
At the moment I am being bombarded by IP addresses all starting with 43.173. and all "Viewing unknown page" where they have no access.

I tried putting the address 43.173.* in discourage mode where they are redirected but that slowed my site down almost to a halt. I then removed it from discourage mode, after which everything worked normally again.

Around 1000 guests at the moment but the 43.173.* IP's are the only one I'm seeing in the guests list at the moment so they seem to be pushing everything else out. Not sure what to do. And why would the site almost shut down when I redirect them in the discourage mode?

View attachment 328968
Same here. I IP Range banned them. But i think i might go into my CPanel and block them in there.
 
Cloudflare makes enough of a difference that it should prevent your site from crashing i think. But yes, Xenforo's site shows us that it's not as effective as we'd like.

You can also insert ban lists in cloudflare but that shouldn't be your primary defense, AI scrapers change their IPs frequently enough. Your best defense system catches the new IPs based on behavior. IP banning boosts the defense, but it's going to take some manual cultivating.

I prefer the fail2ban route - which is probably not available to you since you're on a managed service, due to it's better results. But you need someone with unix knowledge to tune it, so just use cloudflare for now, it's your best shot.
 
Got hit with this last week also. Noticed HUGE spikes in random traffic coming from Brazil where usually I get almost none. What I did was in Cloudlfare, head into the security settings and set up a rule to present an "Interactive Challenge" to any traffic coming from Brazil. Stopped it dead in its tracks. Actual users get hit with (I'm guessing I'm not sure what it looks like) the "I'm not a robot" thing or similar, while the bots just get stonewalled. For now anyway.
 
Also started getting hit a few weeks ago, my site is not huge, I went from about 20-30 guests to 700-900 guests (or bots, I should say). I have all the bot stuff implemented from Cloudflare. I would highly recommend some sort of guest page caching to alleviate the stress on your server, though.
 
Sad to hear that cloudflare is almost not working.

To me, making everyone click a captcha is a bad option because it gets in the way of the user experience. Want to avoid that whenever possible.

I think that also, since AI can control a browser or computer fairly well at this point, that defense technique is going to get broken through eventually and become worthless. This may happen when the next 1-2 generations of AI hardware comes out, which should lower the electricity expense of using AI to the point where it's economical to use it to break captchas, instead of currently, hiring third worlders to do the dirty work.

We may see this defense technique break in a few years. That makes me nervous. this is why i built an uncommon type of defense system that runs on the server and makes it'd decisions based on hints from the application and webserver logs. Cloudflare can't see all that information, so the accuracy which you can dole out bot punishments is limited.. and you may end up punishing the whole class..
 
Last edited:
Various websites that I host have been the target of these mass scraping/ingesting events since Summer of 2025. The problem is, this is worse than a cat-and-mouse whack-a-mole game. This is almost a losing battle, and it requires some significant time devotion to at least shun or rate limit these abusive bot networks. Which means significant modification of configurations at the server level, that includes nginx/apache, and even higher up at the server level such as using ipset lists via nftables/iptables via full AS CIDR-range blocks. Unfortunately, this method has the potential to nab legitimate clients (Using a VPN via a cloud/datacenter provider), but it's very minute when compared against these AI-scraper bots.

  1. Begin collecting common IP addresses being used at the nginx/apache level. Find which IPs are most prevalent, look up their AS Number. Generate a full IPv4 and IPv6 list, begin blocking them at the iptables level. Blocking various cloud providers is generally a good start.
  2. Utilize things like https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker and just slam the door shut on known bad actors before even hitting the website.
  3. Implement extreme rate limiting (error code 429) against page URLs that are not static content items (small static images, scripts, etc.). By extreme, I'm stating that no legitimate user is going to be requesting 50 random pages every minute, of which have absolutely no correlation to the previously accessed pages.

In worst case scenarios - all temporary, not long-term:
  1. Implement a sitewide, initial-visit captcha/are-you-human check.
  2. If using Cloudflare, you can use CloudFlare's "I'm under attack" tool.
  3. Country-Code CIDR block bans at nftables/iptables level.
  4. Significantly restrict the actual content that can be viewed as a guest, thus requiring an account and logged in with said account to view. (This will harm your SEO rankings)


As for me, I've completely shut the door on the following cloud providers: Alibaba Cloud, Huawei Cloud, OVH Cloud, and TenCent Cloud. I'm not sure who was behind it, and I could care less at this point. What transpired was an absolutely abusive AI-scraper; mass spoofing of user agents, using a MASSIVE swath of IP ranges from all of these datacenters, slamming a server that I host with thousands of pages effectively bringing the server to a halt. In recent weeks, I've applied fairly stringent rate limiting against the IP blocks from GoDaddy, Clouvider and Ionos. I have also observed on and off suspicious traffic from various Brazilian ISPs, doing similar actions as recently mentioned, but it's nothing compared against the previous abusive behavior.


This AI based mass-site-scraping is legitimately annoying, tiresome, and seems to be a race to whomever can ingest the most data - the fastest - and giving absolutely no respect to robots.txt, not providing sources where the content was ingested from, etc.
 
following cloud providers: Alibaba Cloud, Huawei Cloud, OVH Cloud, and TenCent Cloud
Other than TenCent all the others have been pains for us over the last few years with hosting scrapers. OVH in particular earlier this year. We had a lot of trouble off Google's compute platform a few months back and of course every now and then there is a bot on AWS-EC2. At present it's just been the Brazil traffic that got splatted, the rest is all below the level of attracting our attention - although I can see some of it is certainly scraping, but not worth stopping from a resource point of view.

Generally we've temporarily blocked the subnets the traffic is coming from when it's not clearly just say half a dozen or so IP addresses we can individually block. Our temporary blocks are normally six hours, I'd say in about 70% of cases that's seen them gone for a good while. The remaining 30% pick up as soon as the block goes and end up getting another six hours - which is generally enough to see the back of them.

I keep thinking since lots of these "guests" tend to end up in the list in XF as "viewing an error" I should at some point investigate where in the DB that data is stored since it'd be a useful datapoint in applying some blocks, ie how many errors has this IP triggered and is it above whatever we think would be normal. The other useful thing we tend to do in our logs is to add a key to identify member traffic from guest traffic so that we can more easily filter them when manually looking at a problem. That's crudely done based on the presence of the XF cookies. Of course I dare say the scrapers will at some point start using accounts, but for now they seem crude enough that they are just grabbing what they can without logging in.
 
As for me, I've completely shut the door on the following cloud providers: Alibaba Cloud, Huawei Cloud, OVH Cloud, and TenCent Cloud. I'm not sure who was behind it, and I could care less at this point. What transpired was an absolutely abusive AI-scraper; mass spoofing of user agents, using a MASSIVE swath of IP ranges from all of these datacenters, slamming a server that I host with thousands of pages effectively bringing the server to a halt. In recent weeks, I've applied fairly stringent rate limiting against the IP blocks from GoDaddy, Clouvider and Ionos. I have also observed on and off suspicious traffic from various Brazilian ISPs, doing similar actions as recently mentioned, but it's nothing compared against the previous abusive behavior.

This AI based mass-site-scraping is legitimately annoying, tiresome, and seems to be a race to whomever can ingest the most data - the fastest - and giving absolutely no respect to robots.txt, not providing sources where the content was ingested from, etc.
Can you post or DM me the IP ranges? I recently had my wiki slowed to a crawl running on 100% RAM by a similar scraper with spoofed user agents coming out of GoogleUserContent servers and I eventually blocked their entire range and I might as well block those clouds before they come after my site.
 
If you lookup the AS Numbers for the various providers you can see what prefixes they are advertising to get a full list for each. So for instance Alibaba are advertising some ~440,000 addresses - https://bgp.tools/as/45102#prefixes or https://ipinfo.io/AS134963#block-summary (plenty of other companies offering such data). That's different to the ranges they might own themselves of course. I'm sure there are lots of perfectly legitimate systems hosted on those clouds, but if they should be talking to your XF - well separate question I guess.
 
Sad to hear that cloudflare is almost not working.
Cloudflare helps a lot, but you have to put many hours of work into the config.

I downloaded crawler IP json files from about 10 legitimate good crawlers and made a Cloudflare list from those IPs. Then I made a security rule that says:

(
not ip.src in $known_good_crawler_ip_addresses
and (
http.user_agent contains "GPTBot"
or http.user_agent contains "ChatGPT-User"
or http.user_agent contains "PerplexityBot"
or http.user_agent contains "Perplexity-User"
or http.user_agent contains "OAI-SearchBot"
or http.user_agent contains "DuckDuckBot"
or http.user_agent contains "DuckAssistBot"
or http.user_agent contains "integralads"
or http.user_agent contains "Criteo"
or http.user_agent contains "AmazonAdBot"
or http.user_agent contains "AppleBot"
or http.user_agent contains "bingbot"
)
)

Block the request.

About 75% of all requests with user-agents that match the above are forged.

Also turn on:
Definitely automated traffic: Managed Challenge
and
Likely automated traffic: Managed Challenge
and
Verified bots: Allow
and
Challenge all request ! from the US - (Brazil and Singapore are the two most offending, but the cumulative number of worldwide bad bot requests are astounding. Only 1.16% of these requests can answer the Cloudflare challenge)
and
build your own list of user-agents that should always be blocked. This will take a ton of time, but it's worth it.
and
Rate shape the useful bots that make too many requests. Some will respond well to a 429, others are just rude and won't.

However, NEVER block or rate shape Googlebot, else pay the price. The "Verified bots: Allow" has a good list of Googlebots and will allow legit requests from Google and block forged Googlebot user-agents. My point is just let Cloudflare manage Googlebot and don't do anything else.

This has reduced our "guests" from 20,000 to 6,000. 6,000 is only about 25% > the prebot storm.

If anyone wants my lists, I will be glad to share. I hate bad bots.
 
Last edited:
Can you post or DM me the IP ranges? I recently had my wiki slowed to a crawl running on 100% RAM by a similar scraper with spoofed user agents coming out of GoogleUserContent servers and I eventually blocked their entire range and I might as well block those clouds before they come after my site.
I've sent you a DM. :)

GoogleUserContent is generally going to be in the ranges of 35.208.0.0 - 35.247.255.255. ARIN reports it having the following CIDRs: 35.208.0.0/12, 35.224.0.0/12, and 35.240.0.0/13.
 
Last edited:
Back
Top Bottom