Crazy amount of guests

Thankfully, using Anubis in a blanket-coverage configuration also nabs these residential proxy clients. Good clients also run thru the check, and on average takes anywhere between 3 and 15 seconds (depending on machine age).
This is very interesting! If that's true Anubis seems to be possibly the only tool that is able to filter them out currently. Could you explain a little more how that works? Does this mean a legitimate client will have to wait between 3 and 15 seconds before gaining access to the URL it requested?
 
Rather impossible for @CTS :


So as far as I know he doesn't have access to .htaccess (pun intended).
Oh, I seem to have skimmed over that when i was catching up on this thread. That unfortunate that such a feature is not offered on their cloud platform, no less the lack of being able to apply more extensible rules via .htaccess or the likes thereof.


This is very interesting! If that's true Anubis seems to be possibly the only tool that is able to filter them out currently. Could you explain a little more how that works? Does this mean a legitimate client will have to wait between 3 and 15 seconds before gaining access to the URL it requested?
It operates in a similar fashion to how Cloudflare's WAF check functions (just without a manual turnstile UI-bit). In the case of Anubis, it does not utilize any captcha, but operates on the basis of a challenge that has to be solved by a clients machine, automatically. There is no user interaction at all, unless the user clicks the show more details button.

You have the full capability to set who gets to be screened via advanced settings such as a cookie verification (e.g.: user is logged in --> bypass the check), how harshly to screen them (I want IP range of 10.0.0.0/24 to have Challenge of 3 and 10.10.0.0/24 a level of 16), what sections of your website is to be screened (I want /threads/* to be screened by Anubis and nothing else), etc.

And then you can set varying levels of challenge difficulties. Where 1 to 3 is generally easily solvable by most bots and clients, including these AI scrapers - takes milliseconds to complete. Levels of 4 to 5 begin to really slow down clients connecting to the service, but ultimately depends on CPU speed of the client machine. Majority of bots, including these AI/LM Scrapers have yet to get past a level of 5. I've seen very few get past a level of 4 - but it's very uncommon. Once you get into levels of 6 and higher on a challenge, it can generally take up to a minute to solve - which is not ideal for legitimate clients. A level of 16 effectively becomes 'impossible' to solve - excellent to use as a shun-list.

My only worry is if Anubis becomes too widely used, that these AI/LLM Botnets will be configured to wait it out to continue it's ingestion hell.
 
Actually, this is a predicament our website is facing: too many AI crawlers are frantically scraping website content for training. We've also been under this kind of attack. We have only dozens of website visitors, but over 10,000 bots, and that number is still growing. Another website, with over 300,000 'visitors,' is simply impossible. Finally, we used AI to write a small altcha verification program, successfully blocking these crawlers.
 
Yep, cloudflare, fail2ban, or something equivalent is now a prerequisite for running a website now.

Not a single one of the 32 servers i manage does not have this problem, in addition to lots of people proving for vulnerabilities and trying to break passwords.

So is it the case that everyone has to click a captcha?
I personally work hard to avoid having to do that, because it's not friendly.

I also think in the future, if the AI can figure out how to generate a working captcha system, it can definitely figure out how to complete the captcha.

What you probably want is more like a proof of work system such as anubis. The idea is to increase the computational cost on the attacker's side to the point that they are deterred from the site.
 
What you probably want is more like a proof of work system such as anubis.
As a side note, someone on Github posted this as a proof of concept:


A way to trip up scraper bots who get past other filtering/blocking, a method which uses CSS rather than JavaScript (in those cases where a human visitor has JS disabled, this would still work).

I found this linked under an issue posted for Anubis. And having read about what Anubis does...for the bots doing an end run around our other filtering/blocking, I like it. I just wish I had a spare server I could test this on. I don't really want to try this on production servers. And I also wonder how much of a load it would put on the system when there are typically 3,000+ legit users visiting (typically 33% logged in, the other 66% human guests).
 
Interesting.
I think a lot of scrapers are using real browsers via something like chromedriver. The chinese bot farms tend to use mobile phones because they're cheap. I wonder how effective it is.
 
Not sure if this was identified, but I found this in my logs previously. This mob Bucklog, was one of the offenders, literally hitting me thousands of times. They don't hit your domain, they hit your server IP directly, so Cloudflare or AS blocking, won't work. You have to block the CIDR ranges for their two servers in your server firewall, so it drops those IP's immediately and doesn't consume your PHP / DB resources.

170.39.217.0/24
185.177.72.0/24

It took me a while, because I had the AS blocked at Cloudflare, but they were still hitting the server. So then I thought they were routing through a non-proxied sub-domain on Cloudflare, grey, but that wasn't it either. Basically after sifting through lots of logs, found they were hitting the server IP directly and they are known for being nasty.
  • Owner/ASN: Bucklog SARL (AS211590), hosted in France (often listed as Paris area or Vélizy-Villacoublay).
  • Reputation: This entire /24 (185.177.72.0/24) is heavily flagged across threat intel sources (AbuseIPDB, CleanTalk, CrowdSec, SOCRadar, etc.) for spam, brute-force attacks, hacking attempts (e.g., probing /info.php.bak, common vuln paths), reconnaissance/scanning (e.g., Next.js metadata probing), and general malicious activity. It's been active in reports since mid-2025 and continues into 2026. High confidence of abuse—many sources treat the whole subnet as noisy/malicious background noise or bot/scanner traffic.
I had more than this doing it, but this was the main offender, showing thousands of visitors in my site at times. They were hitting me, then stopping, then hitting me, stopping. The CPU and DB loads were going insane. Again, not just this one, I had others in the thousands as well, but some of them were via the domain, so I could block them at the edge in CF, and some were doing similar, direct to the server IP and had to be blocked at the server firewall instead to drop immediately.
 
It took me a while, because I had the AS blocked at Cloudflare, but they were still hitting the server. So then I thought they were routing through a non-proxied sub-domain on Cloudflare, grey, but that wasn't it either. Basically after sifting through lots of logs, found they were hitting the server IP directly
With our clients who want to use Cloudflare we often limit connectivity to just Cloudflare's IP ranges (and perhaps any static client ones) and drop everything else at the firewall or do the same in our Nginx layer (using ngx_http_geo_module) if they need something more nuanced. As you say well worth killing ahead of hitting any seriously expensive computation layers.

To be honest other than a big burst of scraping from Vietnam/China/Singapore a week or two back it's been much quieter, but we did introduce a bit more caching so it might just be that the load isn't enough to warrant looking at.
 
Bad news: We should be prepared for a huge wave of a new category of AI powered scraping bots and spambots in possibly very near future. Background: Two weeks ago an open source project named Open Claw started to get a lot of hype within the AI nerd scene. It was first released in Nov. last year, got renamed a couple of times and finally, at the beginning of Feb 26 got a lot of traction all of a sudden. It is basically a kind of orchestrator, that can work with many different AI models and integrates them into our digital life by automating a lot of tasks thorugh AI agents that are powered by add ons, so called skills. The idea behind it is to act as an autonomously acting digital assistent - very much as if you had a human personal assistent. The difference to AI as it was common up until now is (somewhat oversimplified) four fold:

1. It makes the use of not only AI but automated AI workflows easy and simple
2. it can run on basically any device, locally as well as on a VPS or similar as Couldbot itself is just an orchestrator, tking up just a couple of MB of space. With a basic mac Mini you can even run the whole model (like i.e. Olama) locally.
3. It has - in opposite to usual use of chatGPT and such - a memory about what it did and what it's basic character is and this way is able to work in iterations and to improve itself
4. It is able to communicate actively via things like mail, telegram, whatsapp and even voice for both: input and output.

There is way more to it, but these are the basic relevant elements. Within the last two weeks the thing got improved massively - thousands of commits to the repository and thousands of new skills added to the list of options. It is the fastest growing and fastest developing AI project t of all times.

IMG_3963.webp

People quickly claimed to run their whole company based on it and to replace (or not need) loads of staff because of that. Obviously nobody knows wether all that is true and there's a lot of hype and made up calims but it is clear that this thing seems to be a milestone already for the use of AI. The author (a guy from Austria who has coded the thing alone with AI) announounced yesterday evening that he will work for OpenAI in future and that OpenClaw will be transferred into a foundation (and shall stay open source).

The bad news for us is: The agent model of open claw massively depends from the ability to access all kinds of information on the web. It is able to use all possible ways of authentification, can act on behalf of it's "owner" and is even able to create and use accounts on it's own and autonomously. It has the need and the ability to scrape informations from the web and to overcome all sorts of bot-detection. As this is needed for it's purpose this gets developed further constantly.

This alone would be bad enough - but it get's worse: As to be expected shady marketing people jumped on the train quickly to (mis)use the tool for their practices. They claim to have i.e. set up their farms to create and use reddit accounts automaticlly and post their messages - with a claimed rejection rate of just 0,5%.

IMG_3958.webp


IMG_3962.webp
In the meantime, these abilities can be bought. And - as to be expected - there's already a skill to use networks of resident proxies:

IMG_3951.webp

While the whole thing is still in it's early stages the speed of development and adoption seems crazy. So it is probably only a question of time when this tool will hit forums as well to massive extend - by normal users as well as by spammers and scrapers (and those will be first). Plus one can assume that this is just the beginning and more and more tools of this category will be developed in near future and lower the amount of abilities needed to use something like that further. In fact, yesterday an Ai company announced to have Open Claw integrated so far for their customers that it simply runs in a browser tab - no setup though the user needed at all.

So we better should prepare for the wave that is probably going to hit us any time soon.
 
Last edited:
As a side note, someone on Github posted this as a proof of concept:


A way to trip up scraper bots who get past other filtering/blocking, a method which uses CSS rather than JavaScript (in those cases where a human visitor has JS disabled, this would still work).

I found this linked under an issue posted for Anubis. And having read about what Anubis does...for the bots doing an end run around our other filtering/blocking, I like it. I just wish I had a spare server I could test this on. I don't really want to try this on production servers. And I also wonder how much of a load it would put on the system when there are typically 3,000+ legit users visiting (typically 33% logged in, the other 66% human guests).
The CSSWAF approach has a serious issue: https://github.com/yzqzss/csswaf/issues/5 - a botrunner can acquire a cookie, and apply that onto other clients/bot-machines and completely evade the process. Which is a big no-go when it comes to WAF's.

This alone would be bad enough - but it get's worse: As to be expected shady marketing people jumped on the train quickly to (mis)use the tool for their practices. They claim to have i.e. set up their farms to create and use reddit accounts automaticlly and post their messages - with a claimed rejection rate of just 0,5%.
You said something that is legitimately music to my ears, and if it wasn't obvious: What happens when a social media platform is so severely astroturfed, and so severely spammed-out? The platform begins to become a toxic hellhole, and people go elsewhere.

Where is that you might ask? Be it a niche community website or another social media platform.

Also to mention - you shouldnt allow to access forum directly via server IP address. Make separate vhost for it so it serves 503/404 which will be consuming less resources.
Agreed. In this day and age, one should never permit direct IP access to a website. Virtual Hosts (apache2) and server {} (nginx) clauses exist for a reason, and they needs to be used for that very purpose. On my nginx installations, I just throw direct IP and non-vhost accesses error 444. There's no need for a legitimate client to be snooping around.
 
Last edited:
You said something that is legitimately music to my ears, and if it wasn't obvious: What happens when a social media platform is so severely astroturfed, and so severely spammed-out? The platform begins to become a toxic hellhole, and people go elsewhere.

Where is that you might ask? Be it a niche community website or another social media platform.
That would be the most positive outcome one could think of. However: I do not think this is realistic and if so clearly not longterm. These bots never sleep, they don't get stopped by captchas, 2-factor or any currently known mechanism. They write like humans and behave like humans, so it is hard to detect them at all and they are able to learn, so they adapt to whatever mechanisms you invent - that is in fact part of their purpose. Neither they nor "their humans" do have any moral or would accept any borders or TOS. Clearly not the spammers, but also not the "normal" users that use these bots - and there is absolutely no reason why they should only focus on big platforms. They can (and probably will) infiltrate any platform independent of the size or topic as long as it promises them benefit and the level for being considered beneficial is pretty low as running such a bot does cost basically nothing and resources scale up endlessly.

While until now dealing with scrapers basically means dealing with guest (as is the topic of this thread) and this has become annoyingly enough thanks to resident proxies this will change and in future more and more scrapers will possibly act as members and you cannot identify them in beforehand and they are able to sign up faster than you are able to identify and ban them. A whole new level in the endless game of whack-a-mole.

That via the same mechanisms they are also able to spam your forum is just the icing on the cake. So I am way less optimistic than you, sadly.
 
They write like humans and behave like humans, so it is hard to detect them at all and they are able to learn,
Playing devils advocate for a moment - are they a problem then? If they appear indistinguishable from a legitimate user presumably advertisers won't know the difference (although presumably the click through rate will be zero?). Now granted it's not going to make for an enjoyable forum experience (although maybe it'll be high quality AI slop), but I guess it depends if you're running a site for enjoyment or pure profit I guess!?

I'm hoping that at least for a while given forums are a little more discussion focused than some social media which is a bit more "here is something interesting" and everyone basically says yah or nay to it it'll be possible to spot the bots.

I do think it'd be interesting to know if the XF team are looking at any ways of exposing more metrics and data about users to help attempt to identify those who might be legitimate and those who are automated. Common times online, posting frequency, all that kind of data strikes me as whilst perhaps a bit invasive would be useful to study.
 
Also to mention - you shouldnt allow to access forum directly via server IP address. Make separate vhost for it so it serves 503/404 which will be consuming less resources.
Its a combination. IP only hitting the server, driving up load, combined with others hitting the forum, which can be managed. Combined, its painful. Forum is not exposed to IP only.
 
Bad news: We should be prepared for a huge wave of a new category of AI powered scraping bots and spambots in possibly very near future. Background: Two weeks ago an open source project named Open Claw started to get a lot of hype within the AI nerd scene. It was first released in Nov. last year, got renamed a couple of times and finally, at the beginning of Feb 26 got a lot of traction all of a sudden. It is basically a kind of orchestrator, that can work with many different AI models and integrates them into our digital life by automating a lot of tasks thorugh AI agents that are powered by add ons, so called skills. The idea behind it is to act as an autonomously acting digital assistent - very much as if you had a human personal assistent. The difference to AI as it was common up until now is (somewhat oversimplified) four fold:
Hopefully Cloudflare quickly find a similarity in its requests to block.
 
I do think it'd be interesting to know if the XF team are looking at any ways of exposing more metrics and data about users to help attempt to identify those who might be legitimate and those who are automated. Common times online, posting frequency, all that kind of data strikes me as whilst perhaps a bit invasive would be useful to study.
Basically this would be user profiling. I'd assume the data is already there and with an add on it would be usable as well. With an Ai it would probably relatively easy to detect anomalies. However: I am not keen on spying on my users and would prefer to avoid it.
 
I know, being suitably ancient I'd rather not know if my users are humans or canines! Much like all the age verification mess, alas it seems some kind of "solution" is perhaps required to keep things looking presentable. Always someone around to spoil everyone's fun isn't there! sigh
 
It would appear that there's a new round of ResiProxies + datacenters getting used by AI training bots. However, this one is very different compared to the previous iterations that I have observed.

Apart from the another large slew of IP addresses hitting the server, I now have falsified search-engine referrals from Google, Baidu, Bing, among many others using Traditional and Simplified Chinese terms that do not match to anything on the content I host. A small sample of them:

Code:
https://www.bing.com/search?q=%E7%9A%84&form=QBLH&sp=-1&lq=0&pq=de&sc=12-2&qs=n&sk=&cvid=828BF07B60C944A8856F8F9916A6AA65
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=%E7%9A%84&fenlei=256&rsv_pq=0xd095a50700039a90&rsv_t=c807Vf7KGBIyvzMNvEzbwaCY17SN%2F4bIrJpIMwjQO1dHFj1e5ZLhlvyJ37wI&rqlang=en&rsv_dl=tb&rsv_enter=1&rsv_sug3=3&rsv_sug1=3&rsv_sug7=100&rsv_btype=i&prefixsug=%25E7%259A%2584&rsp=2&inputT=917&rsv_sug4=917
https://www.google.com/search?q=%E7%9A%84&sca_esv=6a48bb689016dc57&source=hp&ei=eOIuaYS-DIn1kPIP_uaI4Q4&iflsig=AOw8s4IAAAAAaS7wiKW94PVnXwhvBqDqlzxVsvfCGBMC&ved=0ahUKEwiE2pq5-p6RAxWJOkQIHX4zIuwQ4dUDCBg&uact=5&oq=%E7%9A%84&gs_lp=Egdnd3Mtd2l6IgPnmoQyERAuGIAEGLEDGNEDGIMBGMcBMgsQABiABBixAxiDATIOEAAYgAQYsQMYgwEYigUyCxAAGIAEGLEDGIMBMggQABiABBixAzILEAAYgAQYsQMYgwEyCxAAGIAEGLEDGIMBMggQABiABBixAzILEAAYgAQYsQMYgwEyERAuGIAEGLEDGNEDGIMBGMcBSKsEUABY3gJwAHgAkAEAmAH2AaAB9gGqAQMyLTG4AQPIAQD4AQGYAgGgAvwBmAMAkgcDMi0xoAeOB7IHAzItMbgH_AHCBwMyLTHIBwQ&sclient=gws-wiz
https://www.google.com/search?q=%E7%9A%84&sca_esv=6a48bb689016dc57&source=hp&ei=eOIuaYS-DIn1kPIP_uaI4Q4&iflsig=AOw8s4IAAAAAaS7wiKW94PVnXwhvBqDqlzxVsvfCGBMC&ved=0ahUKEwiE2pq5-p6RAxWJOkQIHX4zIuwQ4dUDCBg&uact=5&oq=%E7%9A%84&gs_lp=Egdnd3Mtd2l6IgPnmoQyERAuGIAEGLEDGNEDGIMBGMcBMgsQABiABBixAxiDATIOEAAYgAQYsQMYgwEYigUyCxAAGIAEGLEDGIMBMggQABiABBixAzILEAAYgAQYsQMYgwEyCxAAGIAEGLEDGIMBMggQABiABBixAzILEAAYgAQYsQMYgwEyERAuGIAEGLEDGNEDGIMBGMcBSKsEUABY3gJwAHgAkAEAmAH2AaAB9gGqAQMyLTG4AQPIAQD4AQGYAgGgAvwBmAMAkgcDMi0xoAeOB7IHAzItMbgH_AHCBwMyLTHIBwQ&sclient=gws-wiz
https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=%E7%9A%84&fenlei=256&rsv_pq=0xd095a50700039a90&rsv_t=c807Vf7KGBIyvzMNvEzbwaCY17SN%2F4bIrJpIMwjQO1dHFj1e5ZLhlvyJ37wI&rqlang=en&rsv_dl=tb&rsv_enter=1&rsv_sug3=3&rsv_sug1=3&rsv_sug7=100&rsv_btype=i&prefixsug=%25E7%259A%2584&rsp=2&inputT=917&rsv_sug4=917
Rolling in with the UA of Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36

Additionally, Anubis seems to have more than 2500 redis-stored keys (via DBSIZE check) active in the past 30 minutes as of me writing this. This go around though, there's an actual failure rate on Anubis, a bit less than 90% effective on filtering. A good portion are being rate limited because of persistent hammering of the web server (error 429), which resulted in it catching my attention.

Most common UA getting past Anubis: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/141.0.7390.37 Safari/537.36 - and that UA is now filtered at the nginx-bad-bot-blocker level.

Most common accept-lang is: zh-CN,zh;q=0.9,en;q=0.8. I'm inches away from adding a severe rate limit to clients with that sort of accept-lang bit at this point in time.

Began at 10:33AM, and at 11:11AM Central USA time, the AI-scraper botnet stopped slamming the server.


Is it possible it's that new AI LLM thingy going around as mentioned above?

Edit: Actually, now i wonder if i can get Anubis to get the accept-lang client and force a much stricter challenge...
 
Last edited:
Damn, you had to ban linux users huh :(

Is it possible it's that new AI LLM thingy going around as mentioned above?

If they are using real browsers, then yes, they can get around almost anything. The more people who use anubis, the higher the cpu tax will be on these bot farms and we could theoretically make the job too expensive if tons of people use something like anubis, so it's a good thing your site is making them pay the tax. I think the only way it could work is that everyone has to pay the tax, though.

We may join you in helping deliver 1 of the 1000 tiny cuts needed soon.

hate this hacker crap.webp

I notice they are getting more clever and more distributed too.
The number of guests on my site no longer affects volume less and frequency of rotation more.
I have yet to go in there and look for more subnets to ban. I'm thinking of writing something that automatically bans those..
 
Last edited:
Back
Top Bottom