Crazy amount of guests

Holy crap, last couple of days/weeks I had about 8k visitors (scrapers) but today that was 32+k :confused:
That's quite a bunch! I usually have 2-2,5k of which about 1k come trough. Yesterday evening it were however 4,8k, so once more a bit of a wave seems to go on. I don't care as still 1k came through and the rest was deflected by ASN- and country-blocking and the checks of proxycheck.io.

The forum here is also a pretty good indicator most of the time. At the moment 31 users and 15k guests:

Bildschirm­foto 2026-04-10 um 16.20.25.webp
 
You need ASN blocks / a better logic engine to do better.
The amount of bots has definitely ticked up in the last week or so and i've had to periodically go in and block ASNs/some networks


I thought of something recently.
People using USA residential proxies are probably located in Asia.
It's possible that the requests to these residential proxies complete in a much longer time than usual vs what you'd expect in the USA.

Apache or Nginx could log the total transmit time and you could write something to periodically analyze those logs and tally up what % of the times went slow ( just in case someone's on wifi which periodically glitches out ) vs went about the average for USA connections and start profiling from there.

If not, there is probably some latency fingerprinting that could be done but that's more complex to pull off.

I'm not pissed off enough to do this computer science experiment yet.
 
People using USA residential proxies are probably located in Asia.
I would not buy this automatically. Do you have any proof or indicator for that or what brings you to that idea?

It's possible that the requests to these residential proxies complete in a much longer time than usual vs what you'd expect in the USA.

Apache or Nginx could log the total transmit time and you could write something to periodically analyze those logs and tally up what % of the times went slow ( just in case someone's on wifi which periodically glitches out ) vs went about the average for USA connections and start profiling from there.

If not, there is probably some latency fingerprinting that could be done but that's more complex to pull off.
I've read about this approach, not specifially regarding Asia as source behind resindent proxy requests but in general, as resident proxies do typically per se have a higher latency than direct connections. Cloudflare wrote about that already in June 2024 and I linked that article further up this thread. But also said, that this alone would lead to a high number of false positives:

we start by comparing direct vs proxied requests and looking for network level discrepancies. Revisiting Figure 1, we notice that a request routed through residential proxies (red dotted line) has to traverse through multiple hops before reaching the target, which affects the network latency of the request.

Based on this observation alone, we are able to characterize residential proxy traffic with a high true positive rate (i.e., all residential proxy requests have high network latency). While we were able to replicate this in our lab environment, we quickly realized that at the scale of the Internet, we run into numerous exceptions with false positive detections (i.e., non-residential proxy traffic with high latency). For instance, countries and regions that predominantly use satellite Internet would exhibit a high network latency for the majority of their requests due to the use of performance enhancing proxies.
 
That's quite a bunch! I usually have 2-2,5k of which about 1k come trough. Yesterday evening it were however 4,8k, so once more a bit of a wave seems to go on. I don't care as still 1k came through and the rest was deflected by ASN- and country-blocking and the checks of proxycheck.io.
Seems to get even worse. In the last 24h it were way more and I had ~600 coming through that shouldn't have.

Bildschirmfoto 2026-04-12 um 16.28.37.webp

On the side of the well known ASNs AS14061/ DigitalOcean was way more nasty than usually (among the usual suspects and two or three new ones). Proxycheck.io still struggles with resident proxies and lets them trough as you can see from the peaks in the stats - these are clearly not normal visitors:

Bildschirmfoto 2026-04-12 um 16.29.41.webp

Peak countries today:

Bildschirm­foto 2026-04-12 um 16.28.50.webp

Mexico has become a favorite of the scrapers over the last weeks and constantly battles with Iraq regarding positioning in the top 5.

If I compare the "clean" statistics of proxcheck.io for the last 30 days I end up with 75% green...
Bildschirmfoto 2026-04-12 um 16.38.15.webp

whereas IP threat monitor that uses my manual ASN, IP and country blocks along with the results of proxycheck.io states pretty much the opposite: 85% of IP addresses blocked.

Bildschirmfoto 2026-04-12 um 16.37.57.webp

So while proxycheck.io is a good foundation it's detection mechanisms are by far not sufficient. Given the positive experiences that @Anthony Parsons has made with Cloudflare challenges I may eventually finally bite the bullet and give it a try as well for guest access.
 
Last edited:
What fun, look at all those lovely new visitors ... I'm sure plenty will be registering their accounts any moment now.
Total: 45,345 (members: 58, guests: 45,287)
At least the site is still nice and responsive (so points for the XF codebase*), but it does grind my gears. All residential proxies this time round as well so much harder to just say "goodbye" to rubbish networks.

* Well and running it on decent hardware I suppose :)
 
Last edited:
I finally solved most of this issue by blocking (challenging) a whole bunch of countries in Cloudflare.

(That is quite easy on our site because 98% of our users are Dutch/Belgian.)
 
Hey smallwheels, thanks for the tip on proxycheck.io. I didn't know there were services like that.

I had a battalion of residential proxies hit one of the store sites i manage on 04-02, and they were all USA.
I ran 40 IP addresses i knew to be fraudulent through proxycheck.io and most are delisted recently, a few are actively listed as active, and 2 were not listed.

That's a pretty good ratio of identification!

So i ran 40 known legit IPs and only found one that had been delisted recently.

This has a high enough false positive rate from that small sample that i could only use it in combination with some other technology.

Proxycheck is awfully cheap for the bandwidth it saves, i'm going to try rolling it into a store protection algo that considers this and a bunch of other factors. This would have stopped most of the madness last time.


How are you hooking it up to your firewall mechanism to stop the madness?


Thank you for the backstory about Cloudflare's research. I have not personally done any of these tests, but i know that by experience a few things about high latency connections:
  • they take a very long time to ramp up download speed
  • as their line is saturated, the effect of latency to the same host can get very exaggerated - i forget what this effect is called - but this can be used to benchmark latency.
  • there may be other ways to fingerprint, i've not done extensive research yet
  • if you can differentiate them from a mobile or satellite provider using some for $ APIs, you may have invented one piece of proxycheck.io

But i think my store client and form client can afford proxycheck.io, i can avoid rocket science this time
 
This also helps.


 
Hey smallwheels, thanks for the tip on proxycheck.io. I didn't know there were services like that.
You have to thank @Osman for it - it is integrated in his IP-Threat-Monitor add on and that's where I stumbled upon it. There are a couple of those IP reputation services but most of them are vastly more expensive. I've no knowledge wether there is a massive difference in detection rates.
This has a high enough false positive rate from that small sample that i could only use it in combination with some other technology.
I didn't see false positives, however a lot of false negatives. As shown in my post of all IPs proxycheck.ip flagged about a quarter of the requesting IPs in the last 30 days whereas, together with my other mechanisms, I blocked about 85% of the requesting IPs. So clearly it does only work together with other mechanisms.
How are you hooking it up to your firewall mechanism to stop the madness?
It is fully integrated as core part of the IP Threat Monitor Addon. Basically every IP that requests is sent to be checked by the API except those who directly run into a country block. Country blocking is done locally via the Maxmind DB to save on API requests. Also, IPs that have already been blacklisted some time in the past are still blacklisted and possibly not checked again.

So to some degree it is even good that there are not so many resident proxies detected as IP Thread Monitor (at least as I understand it), does not release blacklisted IPs (but does with temporary blocked IPs that ran into the rate limiting). Which means: Over time one will overblock IPs that once were used by a resident proxy. It may be that the reason lies within proxycheck.io as the default number of status codes does not seem to distinquish between a resident proxy or other threats.
One can however create a bunch of rules within proxycheck.io on their webpage and this way get more granular API answers, making it possible to deal differntly with IPs depending from the answer provided. I did not play around with this until now as to make use of that I'd have to fiddle around within the code of IP threat monitor which I have no interest in doing.
 
A quick update: Once more the scrapers rotate the datacenter providers they are using heavily. Today I see a massive amount of requests coming from loads of IPs from AS26548 (PureVoltage Hosting Inc.), so probably worth blocking that ASN. I had them in my blocking list already as they had already shown up in the past but not frequently and not at all to the amount seen today.
 
Since I put managed challenge on my search, excluding logged in users, I haven't had the issue. They gave up on me. My only repeat offender now is BingBot, which keeps getting caught on search queries.
 
Could you expand in detail how you did that?
I think he did already a few posts back:

Code:
(
    not http.cookie contains "xf_session="
    and not http.cookie contains "xf_user="
)
and not cf.client.bot

I'm going to try the below one, as I think session is used for guests too, so just to exclude logged in users:

Code:
(not http.cookie contains "xf_user=" and not cf.client.bot)

Managed Challenge. After it has been running for close to 24hrs, 1.5M events with only a 2k solve. I also have a prior rule for (cf.client.bot) to Skip managed rules, which was about 50k. Go into CF settings for the site, Security > Settings > Challenge Passenge and change the managed challenge time to like a day, so it doesn't upset anyone, but still stops everything non-human that one time.

Not sure if its totally correct, but its working. If there is an issue in the above, brains trust here will tell me. Took my proxycheck account down to a few dollars monthly.

View attachment 335759

Oh, and my server is locked down to only accept traffic via Cloudflare, no direct IP access other than my static IP I have.

If you wanted to be less restrictive, you could just target the real issue where from my logs these bots are hitting primarily /search with a username.

Code:
(http.request.uri.path contains "/search/" or http.request.uri.path contains "/register") and not http.cookie contains "xf_user=" and not cf.client.bot

Obviously that would have to be in the expression builder, as it won't let you build that via the WYSIWYG interface.

I guess an even lesser restrictive solution to solely stop search abuse is:
Code:
(http.request.uri.path contains "/search/" and not http.cookie contains "xf_user=")
using a managed challenge with a 24hr setting.

I'm not even sure its search abuse, because from previous discussion here about AI's scraping data to remove anonymity of usernames and piecing that together with other data online to find real names, my logs are showing that this recent influx is mainly targeting /search/username and grabbing content per username. So MAYBE, the lesser restrictive block is all that's needed to stop the real issue that is quite recent, being Governments / Corporations building their own databases stripping online anonymity away via complex AI logic?

Managed Challenge on them. If my understanding is correct, then any real human in a browser, the CF managed challenge will auto solve and its nothing more than a few seconds for the user. All else, it presents with a captcha for solving, which it seems, CF system works effectively and bots aren't solving it. Correct me if wrong, that is my understanding.
 
Could you expand in detail how you did that?
The specific expression for search is: (http.request.uri.path contains "/search/" and not http.cookie contains "xf_user=")

I found the abuse was hitting /search/query which is understandable, as that to me is either a good way to bring a website to a crawl OR its a fast way to quickly scrape all content by user, which is what I found for my logs was by username primarily. Sometimes the query is blank too. Regardless, managed challenge on /search/ stopped my issue. They have tried here and there, but they give up after a thousand attempts or so.

Even others, like Facebook, Apple and SEMRush are trying to run search queries... which is just BS.

This brought my uniques from 1M+ daily to 150k daily.

Screenshot 2026-04-20 064334.webp
 
Last edited:
To expand on two of the bad queries where they continue to probe my site, but then give up, is the below. Look at the search query. By username from Brazil. Empty one from Japan... but because they get caught, their software seems to give up because they aren't getting through.

Screenshot 2026-04-20 064956.webpScreenshot 2026-04-20 064736.webp
 
Back
Top Bottom