Crazy amount of guests

Funnily enough I've just noticed I've had an increase in "guests" - mainly vietnam and some Brazil and China. About 500 at the moment - not as high as some but that's high for my site. Average is usually 180 including robots I know about as well as guests. This is just 500 ish guests only.

So what would it be from vietnam?
 
Oh, the traffic has been bouncing about different southeast asian countries for a while now. The enormous network in Brazil was kind of a surprise.
 
Funnily enough I've just noticed I've had an increase in "guests" - mainly vietnam and some Brazil and China. About 500 at the moment - not as high as some but that's high for my site. Average is usually 180 including robots I know about as well as guests. This is just 500 ish guests only.

So what would it be from vietnam?
If you're getting persistent random-page-access-like traffic from parts of the world that traditionally would not visit your site, you're possibly having something known as Residential Proxies, aka RESIP's 'browsing' your site. Check out this page if you're curious to learn more: https://ieeexplore.ieee.org/document/10814519 ( https://dl.ifip.org/db/conf/cnsm/cnsm2024/1571050912.pdf )

For example, at varying instances during 2025, I have seen an increase in usage from such foreign IP's, and effectively doing "AI Scraper things". That being random page accesses with no correlation to the previously visited page(s). Such becomes rather apparent with various South American ISP's and Chinese originating clients when I see a slew of browser headers Accept-Language reporting zh-CN or just zh via those South American IP addresses. Sure makes you go "hmmmm, well that isn't right!" - especially when its by the hundreds just request flooding the site with no correlation of a normal user 'page to page' navigation (Index -> Forum -> Thread progress).

One can begin applying country code based rate limits or outright blocks, and these AI Scraper companies/institutes then utilize other means to evade being limited/restricted from your resources. I have seen AI Scrapers utilize vast swaths of AWS Datacenter IP ranges, then use various smaller cloud providers that allow rapid spin-up/shutdown of instances all getting their own unique IP addresses. Now it's seemingly using the most round about way possible to grab data, proxies. :(
 
Right. You're playing whack a mole with country bans if your server protection isn't smart enough to figure out how to behaviorally analyze what's going on per IP address, which is technically possible if you do it at the PHP level.
 
This:

 
Be careful, I recently read on the forum that someone blocked all IPs from Singapore and mistakenly included Google's IPs from that location as well.

That's why it is necessary to find the IP block(s) of google's crawler and whitelist them.
Their crawler does not fully adhere to robots.txt and will violate many rules you set to try to defeat bots.

I found this out the hard way. Hopefully nobody else has to :)
 
Had pages and pages of bots trying to DDOS me. All from amazon, singapore and china.
I already don't bother with indexing to google because i get stalked by lowlifes who should know better.
 
Seems that both me and cloudflare users have both seen a 2xing of the bot count in the last hour or so.
Not happy about it.

Looks like i need to get this next generation bot protection coded up sooner than later.
 
Ultimately, everything was resolved by installing a free anti-bot, which operates on a whitelist system (IP, URL, ASN, bot). Anyone not on the list solves the captcha. There are blacklists too, but I haven't used them yet.
I can already say that the most sensitive ones remain at the same level of captcha solving.
 
Got the same situation. I've set the whole forum so that its unreadable without logging in and the guest log still shows bots reading threads. At peaks its the ratio of legit to bots is 1:50.

I have no idea how they are showing being able to read threads they have no access to.
 
Got the same situation. I've set the whole forum so that its unreadable without logging in and the guest log still shows bots reading threads. At peaks its the ratio of legit to bots is 1:50.

I have no idea how they are showing being able to read threads they have no access to.
If you've limited viewing access off from Guests to logged in users only, when a future-accessing Guest goes to view something, the online list should show that ⚠️ icon.
brave-11-28_09-18-c8261508-7bc1-4913-8b20-abd97e519153.webp

If guests are still able to view threads after you have explicitly disabled access to such, you need to check what nodes are allowing/overriding such.
 
Ah, i thought that they were online if that was there.

In which case they're all viewing errors.

I'll try a mixture of cloudfare and that then. Even with everything set to private at Node level I'm seeing 1:20 ratio.
 
Just had a thought with these guest-AI-related explosions on various websites, most scrapers are after the content on a discussion board... so, where's the content on just about every forum? Threads.

Apart from using a WAF as a site-wide deterrence, does an addon exist to intentionally serve a captcha of choice (with what has been configured under ./admin.php?options/groups/basicBoard/#captcha) to non-logged in users that are not friendly indexing bots when attempting to view threads?

I figure this would be the utmost user-friendly and straight forward approach to filtering out the mass-scraping from AI bots. Saves everyone from having to ban heaps of IP ranges too.

Edit; It seems that someone already made an addon for what I was describing: https://xenforo.com/community/resources/thread-view-captcha-ai-scraper-bot-protection.9947/ - price is iffy, but hopefully does give a bit to the 'peace of mind' aspect.
 
Last edited:
anubis would be capable of doing this. But it takes server skills and root access to set up.
I want to like Anubis, but i think it is possible to build something smarter that's also less complicated.
Recently i came up with the right database to power something better, so i'll be writing a prototype soon.

Edit; It seems that someone already made an addon for what I was describing: https://xenforo.com/community/resources/thread-view-captcha-ai-scraper-bot-protection.9947/ - price is iffy, but hopefully does give a bit to the 'peace of mind' aspect.

Marked unmaintained after a single release and no ratings. Wonder how good it is. Big question..
 
anubis would be capable of doing this. But it takes server skills and root access to set up.
I want to like Anubis, but i think it is possible to build something smarter that's also less complicated.
Recently i came up with the right database to power something better, so i'll be writing a prototype soon.



Marked unmaintained after a single release and no ratings. Wonder how good it is. Big question..
Anubis seems like a nuclear option, but at least appears to be modestly configurable. Looks like anubis will be a decent weekend project to mess about with. One website I manage may actually benefit greatly from that.

Have to wonder if anubis is going to play nice with the nginx bad bot blocker at: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/

Edit; After checking out the Anubis source code for the past hour, I'm questioning some of its source code. I've seen a few IRC references (1, 2, 3) within the code for a HTTP based filtering system. While there's nothing wrong with reusing old code, some of it just feels "hmmm". It seems to just be limited to DroneBL's RBL, but doesn't seem to be expanded past that - which is a limiting factor if using a DNSBL. Additionally, it would also appear (from what i see anyways) that Anubis is a bit on the "pay me if you want to edit the templates" plane. Hopefully i'm wrong with that last claim.
 
Last edited:
Anubis seems like a nuclear option, but at least appears to be modestly configurable. Looks like anubis will be a decent weekend project to mess about with. One website I manage may actually benefit greatly from that.

Yes, that's why i don't like it. I wish the majority case was that it didn't interrupt a legitimate user's experience often, if at all. But it doesn't seem to work that way.

Edit; After checking out the Anubis source code for the past hour, I'm questioning some of its source code. I've seen a few IRC references (1, 2, 3) within the code for a HTTP based filtering system. While there's nothing wrong with reusing old code, some of it just feels "hmmm". It seems to just be limited to DroneBL's RBL, but doesn't seem to be expanded past that - which is a limiting factor if using a DNSBL.

Thanks for prompting me to look into the source code. It seems to be rich in well updated banlists. :)

Additionally, it would also appear (from what i see anyways) that Anubis is a bit on the "pay me if you want to edit the templates" plane. Hopefully i'm wrong with that last claim.

They are asking $500/month for the ability to customize it and not have greet your users with some weird anime furry character.
I'm sure you could compile your own version but keeping it up to date is going to be a pain.
 
Read this today..
https://www.techspot.com/news/110432-alibaba-bytedance-moving-ai-training-offshore-bypass-china.html

Alibaba and ByteDance are among the companies routing training jobs for their latest large language models to data centers in countries such as Singapore and Malaysia, according to The Financial Times, citing people with direct knowledge of the deployments. These sources say there has been a steady shift toward offshore clusters since April, when Washington moved to tighten controls on Nvidia's H20 accelerator, a chip designed specifically for the Chinese market.
 
Back
Top Bottom