Crazy amount of guests

thomas1 · Jan 25, 2026

We have completely blocked Bytespider, ByteDance, and BytePlus. They're just a waste of bandwidth and server resources.
.

Robf23 · Jan 25, 2026

Cloudflare has solved all the problems for us. It’s also filtering the vpn’s effectively so even the spoof registrations have dropped significantly.

ES Dev Team · Jan 25, 2026

Found more and it looks like these bots saturated the available tcp/ip ports numerous times last night..

Code:

#bytedance bots - unlabeled 01/25/2026 - DS
Deny From 45.78.192.0/18
Deny From 101.47.0.0/18
Deny from 101.47.112.0/20
Deny from 101.47.128.0/17

ES Dev Team · Jan 25, 2026

thomas1 said:
We have completely blocked Bytespider, ByteDance, and BytePlus. They're just a waste of bandwidth and server resources.
.

How do you do it?

I of course have these agent strings banned but bytedance bots since last night are not identifying themselves as a bot in the browser string.
And a few new IP address ranges appeared while i was sleeping, so i didn't catch it.

I use apache .htaccess to ban those ranges, then fail2ban turns that into an iptables ban. Not the most efficient, but works and it's nice to just be able to edit a banfile and not have to restart processes etc.

A big problem on my server is that they always hit http:// first and this redirects to https://
This means 2 tcp/ip connections per hit and this botnet is huge so process limits, etc are easily overwhelmed.

I have some tcp/ip tuning that ups the available ports to 2048.. despite bumping up many apache settings.. it still chokes around 1000 connections total.

I am using the default mod_php and mpm_prefork and might have to go to the slower FPM and i'm not happy about it but..
The show must go on!

Jja · Jan 25, 2026

ES Dev Team said:
How do you do it?

You wont like it - Cloudflare

beerForo · Jan 25, 2026

Cloudflare
Major AI Bots Blocked
GPTBot (OpenAI - ChatGPT)
ClaudeBot and Claude-Web (Anthropic - Claude)
CCBot (Common Crawl - used by many AI models)
Meta-ExternalAgent and Meta-ExternalFetcher (Meta - AI training)
Bytespider (ByteDance - TikTok/Doubao)
Amazonbot (Amazon)
Applebot (Apple)
Anthropic-AI
Google-Extended (Used for Bard/Gemini training)
Google-CloudVertexBot (Google Vertex AI)
PerplexityBot and Perplexity-User (Perplexity AI)
DuckAssistBot (DuckDuckGo)
TikTokSpider (ByteDance)
ImagesiftBot

smallwheels · Jan 25, 2026

Jja said:
You wont like it - Cloudflare

Just wondering how much they may have improved. A couple of weeks ago people reported here on the forum that cloudflare bot protection would not work against the latest stragegies of the scrapers (resident proxies). The huge wave of massive requests from single ip blocks that happened last year seem gone now for the most part - at least with my forum I haven't seen anything like that in a while. What I do however see is a permanent stream of requests from resident proxies from all around the world. I can identify them easily as my normal audicene is 99% from Germany, Austria and Switzerland. A couple of other countries occur as well - but not the countries I see and not the amount I see. So 99% of the requests that don't come from the DACH region are scrapers in my case.

The second pattern is, that each IP typically does only one single request, typically targeting an older thread or a posting within it, an attachment or a user profile. A normal visitor pushes a couple of requests for each page he visits, so it is easy to identify the bots - but only in hindsight. Cloudflare would have the possiblities to do this better but to do it really good they would have to know about the content they act as a proxy for which they hopefully don't.

So when you write

Robf23 said:
Cloudflare has solved all the problems for us.

I wonder if that is really the case or if you just don't see the scrapers any more as they now act like a swarm of moskitos and no longer like an elephant.

Robf23 said:
It’s also filtering the vpn’s effectively so even the spoof registrations have dropped significantly.

I have zil spam issues on my forum, despite not using cloudflare. I've been blocking malcious IP Ranges and ASNs for quite a while now pretty radically and it seems, that in fact most of the automated spam registration attempts seem to come from Russia, directly or indirectly. It seems to be only relatively small number of different actors, but they use IPs from all over the world including a lot of hosters that also seem to trace back to Russia in one way or another. The manual attempts seem a bit wider spread but often from India or Pakistan.

Ozzys Spaminator catches the bots reliably, a couple of countries are not allowed to register anyway and the occassional bad guy that get's around that get's caught by Xons Registration and Multiaccount Blocker. Not much to do for it however - maybe one or two over the last six months.

The resident proxies however are indeed a problem as they use normal dialup connections and the computers of regular home users that often won't know about it. They even use mobile phones and act in fact like a botnet as it was used for DDOS 20 years ago. Each request comes from a different machine and somewehre in the middle there's a spider that orchestrates this distributed scraping. Pretty hard to detect if you have a very international forum and pretty hard to get rid of, if you don't want to block private client networks / dialups to a massive extent, creating massive colateral damage. About half of the requests by resident proxies on my forum do btw. come from the US, form all major providers for private internet access as well as from a lot of smaller ones.

So when you say Cloudflare solved your bot-problem I ask: How do you know that it is solved?

smallwheels · Jan 25, 2026

beerForo said:
Cloudflare
Major AI Bots Blocked
GPTBot (OpenAI - ChatGPT)
ClaudeBot and Claude-Web (Anthropic - Claude)
CCBot (Common Crawl - used by many AI models)
Meta-ExternalAgent and Meta-ExternalFetcher (Meta - AI training)
Bytespider (ByteDance - TikTok/Doubao)
Amazonbot (Amazon)
Applebot (Apple)
Anthropic-AI
Google-Extended (Used for Bard/Gemini training)
Google-CloudVertexBot (Google Vertex AI)
PerplexityBot and Perplexity-User (Perplexity AI)
DuckAssistBot (DuckDuckGo)
TikTokSpider (ByteDance)
ImagesiftBot

This is worth absolutely nothing. These are all bots that don't hide - you can simply block them yourself via .htaccess (and most of them even via robots.txt) within minutes. No need for cloudflare here. Those are for sure not the problem.

smallwheels · Jan 25, 2026

ES Dev Team said:
I of course have these agent strings banned but bytedance bots since last night are not identifying themselves as a bot in the browser string.
And a few new IP address ranges appeared while i was sleeping, so i didn't catch it.

I would block them proactively. Those are server ranges, so they should not visit your forum anyway. The IPs you posted belong to AS150436 and this belongs to Byteplus Pte. Ltd. (which is basically Bytedance). If you want to block the whole block based on IPs instead of simply the ASN you can look here:

ASN-Blocklist - 150436

and end up with:

Deny from 45.78.192.0/18
Deny from 69.5.0.0/20
Deny from 69.5.16.0/21
Deny from 69.5.24.0/23
Deny from 69.5.26.0/23
Deny from 69.5.28.0/23
Deny from 69.5.30.0/23
Deny from 71.18.227.0/24
Deny from 98.96.226.0/24
Deny from 98.98.103.0/24
Deny from 101.45.255.0/24
Deny from 101.47.0.0/19
Deny from 101.47.32.0/24
Deny from 101.47.33.0/24
Deny from 101.47.34.0/23
Deny from 101.47.36.0/22
Deny from 101.47.40.0/21
Deny from 101.47.48.0/20
Deny from 101.47.64.0/20
Deny from 101.47.80.0/21
Deny from 101.47.88.0/22
Deny from 101.47.92.0/23
Deny from 101.47.95.0/24
Deny from 101.47.96.0/23
Deny from 101.47.98.0/24
Deny from 101.47.128.0/18
Deny from 128.1.127.0/24
Deny from 128.1.169.0/24
Deny from 128.1.235.0/24
Deny from 129.227.102.0/24
Deny from 145.223.128.0/18
Deny from 150.5.128.0/17
Deny from 156.59.33.0/24
Deny from 163.7.0.0/17
Deny from 163.7.160.0/20
Deny from 163.7.176.0/20
Deny from 163.7.192.0/18
Deny from 187.42.0.0/17
Deny from 202.52.224.0/21
Deny from 202.52.252.0/22
Deny from 207.166.160.0/19
Deny from 216.19.0.0/18
Deny from 2401:4c20::/38

To be sure a quick check with bgp.tools does not hurt:

AS150436 Byteplus Pte. Ltd. - bgp.tools

AS150436 (Byteplus Pte. Ltd.)'s is a 3 year old BGP network that is peering with 75 other networks and has 9 upstream carriers

bgp.tools

You'll see that it probably won't hurt simply blocking the whole thing.

philmckrackon · Jan 26, 2026

They ignore robots.txt. Use .httacess.

Post in thread 'How to block Robot ByteDance' https://xenforo.com/community/threads/how-to-block-robot-bytedance.231581/post-1749652

chillibear · Jan 26, 2026

ES Dev Team said:
I am using the default mod_php and mpm_prefork and might have to go to the slower FPM and i'm not happy about it but..
The show must go on!

ES Dev Team said:
A big problem on my server is that they always hit http:// first and this redirects to https://
This means 2 tcp/ip connections per hit and this botnet is huge so process limits, etc are easily overwhelmed.

Could it be worth using a combination of Apache for your secure stuff and general serving and then something a little more lightweight such as Nginx to just sit and do 301/302 redirects on port 80 and nothing more? Granted that's not getting rid of unwanted traffic, but it might avoid tying up those (larger because of mod_php) Apache processes on really mundane stuff.

smallwheels said:
What I do however see is a permanent stream of requests from resident proxies from all around the world.

Alas it's now an easy to purchase "service", there are several such as this around now. I would assume (given even I've thought about doing it) that Cloudflare have subscriptions to these services and use those subscriptions to identify the IP addresses in use and at least weight that in their analysis. I can't see of another really easy way except for large scale access pattern analysis (as you've already mooted) to identify compromised (well not really since I assume these are either paid for lines or ones where the home user is being paid for their use - probably against the T&C) "home" IP addresses. Alas I doubt it'll get better as more of the world hooks up to faster home connections.

ES Dev Team · Jan 27, 2026

smallwheels said:
I would block them proactively. Those are server ranges, so they should not visit your forum anyway. The IPs you posted belong to AS150436 and this belongs to Byteplus Pte. Ltd. (which is basically Bytedance). If you want to block the whole block based on IPs instead of simply the ASN you can look here:

ASN-Blocklist - 150436

and end up with:

Deny from 45.78.192.0/18
Deny from 69.5.0.0/20
Deny from 69.5.16.0/21
Deny from 69.5.24.0/23
Deny from 69.5.26.0/23
Deny from 69.5.28.0/23
Deny from 69.5.30.0/23
Deny from 71.18.227.0/24
Deny from 98.96.226.0/24
Deny from 98.98.103.0/24
Deny from 101.45.255.0/24
Deny from 101.47.0.0/19
Deny from 101.47.32.0/24
Deny from 101.47.33.0/24
Deny from 101.47.34.0/23
Deny from 101.47.36.0/22
Deny from 101.47.40.0/21
Deny from 101.47.48.0/20
Deny from 101.47.64.0/20
Deny from 101.47.80.0/21
Deny from 101.47.88.0/22
Deny from 101.47.92.0/23
Deny from 101.47.95.0/24
Deny from 101.47.96.0/23
Deny from 101.47.98.0/24
Deny from 101.47.128.0/18
Deny from 128.1.127.0/24
Deny from 128.1.169.0/24
Deny from 128.1.235.0/24
Deny from 129.227.102.0/24
Deny from 145.223.128.0/18
Deny from 150.5.128.0/17
Deny from 156.59.33.0/24
Deny from 163.7.0.0/17
Deny from 163.7.160.0/20
Deny from 163.7.176.0/20
Deny from 163.7.192.0/18
Deny from 187.42.0.0/17
Deny from 202.52.224.0/21
Deny from 202.52.252.0/22
Deny from 207.166.160.0/19
Deny from 216.19.0.0/18
Deny from 2401:4c20::/38

To be sure a quick check with bgp.tools does not hurt:

AS150436 Byteplus Pte. Ltd. - bgp.tools

AS150436 (Byteplus Pte. Ltd.)'s is a 3 year old BGP network that is peering with 75 other networks and has 9 upstream carriers

bgp.tools

You'll see that it probably won't hurt simply blocking the whole thing.

Thank you so much. Very useful I recently learned the WHOIS command in linux to get IP blocks and it's helped me put together lists more rapidly than using a program that analyzes fail2ban's banlists. This is more powerful.

chillibear said:
Could it be worth using a combination of Apache for your secure stuff and general serving and then something a little more lightweight such as Nginx to just sit and do 301/302 redirects on port 80 and nothing more? Granted that's not getting rid of unwanted traffic, but it might avoid tying up those (larger because of mod_php) Apache processes on really mundane stuff.

You know, this isn't a bad idea. You get the strengths of both engines at what i guess is a very low added expense.
You are still getting 2 connections but the first one, you can drop a lot quicker.

chillibear said:
Alas it's now an easy to purchase "service", there are several such as this around now. I would assume (given even I've thought about doing it) that Cloudflare have subscriptions to these services and use those subscriptions to identify the IP addresses in use and at least weight that in their analysis. I can't see of another really easy way except for large scale access pattern analysis (as you've already mooted) to identify compromised (well not really since I assume these are either paid for lines or ones where the home user is being paid for their use - probably against the T&C) "home" IP addresses. Alas I doubt it'll get better as more of the world hooks up to faster home connections.

Ugh, awful. How is that even legal.

Large scale automated analysis and subnet banning is possible with php. I know because i watched another guy build that in PHP with a database. I just didn't like how it performed, so it's an idea i would not copy verbatim.

What's disappointing with cloudflare is that they do not seem to be ahead of malicious traffic despite having an enormous information advantage. Xenforo sites the size of mine experience the same giant influx of residential proxies, with nearly identical numbers, at different times. And outside of those influxes, the numbers on concurrent guests is about the same over a 24 hour period.

I would like to know that there's a better alternative waiting for me, but it doesn't seem to be the case. Cloudflare seems to require manual cultivation of banlists for high traffic sites just like my system does, or other periodic tuning. Correct me if i'm wrong.

Suzanne O · Jan 27, 2026

Hit the under attack button when on cloudflare. It lets them know that your site is being DDOS attacked.

Alvin63 · Jan 27, 2026

philmckrackon said:
They ignore robots.txt. Use .httacess.

Post in thread 'How to block Robot ByteDance' https://xenforo.com/community/threads/how-to-block-robot-bytedance.231581/post-1749652

That definitely helped me thanks to your info

Although I believe that can't be done on Cloud xenforo service ....... is that what you're on @ES Dev Team?

ES Dev Team · Jan 27, 2026

Nah, i run my own ubuntu server on AWS and use fail2ban, which uses IPtables for banning ( very fast )
I use .htaccess because it's easy and it can send hints to fail2ban.

chillibear · Jan 27, 2026

ES Dev Team said:
Ugh, awful. How is that even legal.

It certainly raises an eyebrow doesn't it. Well there are half a dozen companies offering proxy services. The one I linked to actually uses https://pawns.app/ to supply the end user client devices and bandwidth - in essence end-users get $0.20 per GB of bandwidth they supply. Hell if you're on an unlimited line, why not I can hear many people saying (although I wonder if the T&C for residential lines might prohibit that strictly speaking). The others I'm aware of are https://www.nimbleway.com/pricing, https://netnut.io/static-residential-proxies/ and https://asocks.com/en/ourproxy/, but I imagine there are plenty of others. The whole using end-user connections is an interesting one and certainly has its legitimate uses. I use Global Ping for instance to debug routing issues a couple of times a year.

ES Dev Team said:
ubuntu server on AWS

FreeBSD on our own hardware here split over a couple of DCs.

smallwheels · 2026-02-03T08:17:24+0000

smallwheels said:
What I do however see is a permanent stream of requests from resident proxies from all around the world. I can identify them easily as my normal audicene is 99% from Germany, Austria and Switzerland. A couple of other countries occur as well - but not the countries I see and not the amount I see. So 99% of the requests that don't come from the DACH region are scrapers in my case.

As I was pretty fed up with resident proxies I've now set up a pretty excessive Geoblocking for countries where these users come from plus blocking selected ASNs from countries that I could or did not want to block completely. For my forum I can state that currently between 30 and 50% of the IPs visiting my forum come from resident proxies (and to a small amount from other bogus sources), so are basically scrapers. The amount goes a bit up and down but seems to climb overall condiserably over the last days. That seems quite a lot. On top of that an unknown amount is filtered out additionally by other triggers before the measuring point, so in fact it is even worse.

When you look at the guest numbers here at the XenForo forums you can see that as well: The number of guest users shown is on the rise again. Over the last weeks it was "down" to noticably below 2.000 guests at any given time, yesterday afternoon CET there were

and right now in the early morning there are

A member:guest-ratio of 1:400 seems pretty insane for a forum like this. So there seems to be another big wave in process at the moment. I don't recall exactly what the highest numbers were I saw on here a couple of months ago for quite a while - I think it might have been between 7.000 and 8.000.

The top countries were the proxy visits in my forum come from vary a bit, but the US are always within the top five, often being at number one. Other countries that rank in the top 10 regularly on my forums are i.e. Brazil, Argentina, India and Vietnam. In total I've currently geo-blocked ~75 countries.
The networks of the big providers for private internet access in the US seem to be pretty poisoned, so bad news for those of you with a lot of regular audience from the US. There seems not much you can do effectively against them if you can't block those providers b/c you would block your regular users as well then. Recent research studies state that only about 10% of the resident proxies would be discovered by services like Cloudflare and others.
In my case I do have a blind spot in the countries within central Europe that I cannot block due to regular visitors from there and also cannot identify resident proxies in these contries reliably for the most part at the moment. So I don't see the amount of resident proxies from there and it adds to the numbers given further up. I identified a bunch from France but cannot say anything about Germany. One of the sellers of resident proxies claims, he would have 14 million resident proxies available in Germany alone wich seems way too high and not plausible at all, given that there are only about 41 million private households within Germany.

The selling of those private proxies seems a pretty shady business when you look a bit closer. Research states that often the same actors offer under different brands and that they love to resell on top of that to be able to hide better. Also, that there seem effectively only four different pools of IPs to be on the market for all offerings of resident proxies and that only parts of those IPs are used with consent of the private owners of the line. The rest comes through either tricking them out though hidden TOS in some free software and apps, through hidden proxy functions in SDKs that developers use and thus deliver with their software unknowingly, thrugh infections of IoT devices with weak security or through devices that are already sold with a hidden proxy function on board (with IoT devices meaning anything from fridges over IP-cameras to TV-Sticks). Especially with cheap, Android based TV-sticks from Chinese factories this seems to be pretty common.
Consentful offering of the own internet access shinks with the level of education and the economic strength within a country - therefor a lot of those proxies with owner consent are in developing countries and way less in developed countries. Some also create own infrastructure, renting servers from hosting companies or from resellers of hosting companies or even buying huge amounts of mobile phones (as proxies on cellular are the most expensive variant in terms of cost per GB for buyers).

While there are no doubt legitimate uses of these proxies imaginable (and the sellers refer to those excessively) research states that this makes only a very small fraction the market. The more as for those legitimate uses there are less shade and cheaper alternatives in almost every case. As the use of those proxies is paid per GB of traffic and the traffic is pretty expensive there is barely any reason why a honest buyer would deal with those shady companies use this infrastructure.

lazy llama · 2026-02-03T10:42:16+0000

The bots seem to come in waves, each adapted to work around rules that have been implemented to combat them.
Our current load of 3000+ guest bots seem to be making requests to the XenForo image proxy, with thousands of queries for images which don't exist in our cache, e.g.

Code:

165.22.177.180 - - [03/Feb/2026:09:42:48 +0000] "GET /forums/proxy.php?image=https%3A%2F%2Fassets.rebelmouse.io%2FeyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpbWFnZSI6Imh0dHBzOi8vYXNzZXRzLnJibC5tcy8yMTE2MTg0Ny9vcmlnaW4uanBnIiwiZXhwaXJlc19hdCI6MTYyMTM1MDk1MH0.lVlxgvI6iHOD1y2TY0TvgL8hPZgkujy1HCwpSoA1DxQ%2Fimg.jpg%3Fwidth%3D1200%26coordinates%3D0%252C40%252C0%252C40%26height%3D600&hash=736cce0227fae27ea58e52c7297641b9&return_error=1 HTTP/2.0" 404 5 "https://www.ourdomain.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
68.183.128.243 - - [03/Feb/2026:09:42:48 +0000] "GET /forums/proxy.php?image=https%3A%2F%2Fsistemaplastics.com%2F%2Fimages%2Fresizer_cache%2Fassets%2Fproducts%2FMICROWAVE%2FCOLOURED_Microwave%2F21117_EasyEggs_MicrowaveColoured_Purple_Wrap_Vent_258_350_90.jpg&hash=89adef1c16fc02b4de0170b25fd1e9b2&return_error=1 HTTP/2.0" 404 5 "https://www.ourdomain.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
157.245.134.15 - - [03/Feb/2026:09:42:48 +0000] "GET /forums/proxy.php?image=https%3A%2F%2Fblog.datawrapper.de%2Fcow-milk-and-vegan-milk-alternatives%2F..%2Fimg%2Ffavicon.ico&hash=7fe76782378136df9e33a8a3366f698e&return_error=1 HTTP/2.0" 404 5 "https://www.ourdomain.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"
5.36.154.175 - - [03/Feb/2026:09:42:48 +0000] "GET /forums/proxy.php?image=https%3A%2F%2Fwww.independent.ie%2Fbusiness%2Fbrexit%2Ff4511%2F39838816.ece%2FAUTOCROP%2Fw1240h700%2FTim_Cullinan&hash=232e0ff6fa4db2f841f63f9d9cb2747d&return_error=1 HTTP/2.0" 404 5 "https://www.ourdomain.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"

As you can see, they're from widely different IPs, with IPs not being reused often, always with the same (impossible) referrer, and randomly mutating user-agent.

Previously I've used Cloudflare rules to challenge requests for certain URLs - for example lots of requests to the style-chooser - where referrers were blank, as well as challenging the usual suspect countries and ASNs.

chillibear · 2026-02-03T11:06:33+0000

smallwheels said:
A member:guest-ratio of 1:400 seems pretty insane for a forum like this. So there seems to be another big wave in process at the moment.

Yep can confirm that in our traffic patterns and load over the last week or two. Evidently time for an AI refresh or something!

smallwheels said:
There seems not much you can do effectively against them if you can't block those providers b/c you would block your regular users as well then.

Realistically there are only three options that spring to mind:

More detailed user profiling - ie a "normal" user will request HTML, JS, Images, send cookies and so forth that make sense in a normal user journey. Direct scrapers may fall down on this, but given those mobile phone farms you see in action it's clear some of this is going to be genuine browsers probably doing all that anyway. If it's a real browser being driven then it'll be nearly impossible I think to identify if it's a human or good simulation driving that browser. I think on this front we just have to hope that for them it's not worth the effort of doing a "really good job" and we can spot the mistakes.
Signing up to some of the proxy services and sending "test" traffic through them to identify the IP addresses and publishing them. Not that I really want to sign up with any of these rather dodgy seeming companies, but I don't really see a way of identifying the IPs on their books otherwise. Someone may already do this and I really should properly look.
Data sharing (ala stopforumspam) type services, but this feels quite a nuanced issue and I'd worry about high false positives. For instance this week we've had a load of traffic from Vietnam, but I know we do have two or three legitimate members there. So evidently any blacklisting for me would need to have holes punched for their IPs, which may well change. So it's a bit of a moving target. Still an "abnormal for my forum" type data feed is a possibility I guess.

smallwheels said:
One of the sellers of resident proxies claims, he would have 14 million resident proxies available in Germany alone wich seems way too high and not plausible at all, given that there are only about 41 million private households within Germany.

Yep. Given there are about 140M IP addresses currently associated with Germany having 10% of those compromised like this seems off, especially when you look as you rightly point out the households (and population is about double that) seems too boastful.

smallwheels · 2026-02-03T11:40:17+0000

chillibear said:
Realistically there are only three options that spring to mind:

More detailed user profiling - ie a "normal" user will request HTML, JS, Images, send cookies and so forth that make sense in a normal user journey. Direct scrapers may fall down on this, but given those mobile phone farms you see in action it's clear some of this is going to be genuine browsers probably doing all that anyway. If it's a real browser being driven then it'll be nearly impossible I think to identify if it's a human or good simulation driving that browser. I think on this front we just have to hope that for them it's not worth the effort of doing a "really good job" and we can spot the mistakes.

While I absolutely agree that behavioural pattern matching is probably the most promising way to go I think this is pretty tough (and in most cases too tough) for a hobby admin. But it could be worth checking if there is not already one ore more open source projects around the topic. On could even use AI for that.

chillibear said:
Signing up to some of the proxy services and sending "test" traffic through them to identify the IP addresses and publishing them. Not that I really want to sign up with any of these rather dodgy seeming companies, but I don't really see a way of identifying the IPs on their books otherwise. Someone may already do this and I really should properly look.

I know of at least one IP reputation service that does that and would assume that others do that, too. However, as I am using said service I can safely say that until now they only identify a fraction of a fraction of the resident proxies. Barely noticable.

chillibear said:
Data sharing (ala stopforumspam) type services, but this feels quite a nuanced issue and I'd worry about high false positives. For instance this week we've had a load of traffic from Vietnam, but I know we do have two or three legitimate members there. So evidently any blacklisting for me would need to have holes punched for their IPs, which may well change. So it's a bit of a moving target. Still an "abnormal for my forum" type data feed is a possibility I guess.

Could be - information sharing in an automated way could be interesting. It is however challenging, as most of the hosts are short lived. And a huge project to implement.

I would like to add a 4.) and a 5.) to your points, when targeting from a different perspective: How to make the usage of such proxies unattractive. Both have the precondition that you are able to identify a relevant amount of those, which is a bit of a pitfall.

We know two things:

• the scapers are scraping our forums as they want the content, probaly for the training of AI models
• scraping via residential proxies costs per GB of traffic and it is expensive (and often slow)

So my number 4 would be:

4.) poison the content. Send identified scrapers into a huge mess of false, halftrue or completely made up information and let they scrape it to poison the AI models that they are feeding, effectively rendering them useless. It is important to mix up true and false information and to have references linking to the true world to not make it too obvious. Also include huge pictures and graphics to make traffic expensive. On could create a repository for this stuff ("bogopedia") where people could have fun adding this kind of thing - could even be done using XenForo.

When searching for an existing solution in that direction I stumbled upon this as a possible implementation to use and they do mention a couple more:

https://anubis.techaro.lol/docs/admin/honeypot/overview/

5.) keep them busy. This is what cloudflare say they would to, I think in a blogpost from mid last year: Create a labyrinth of links for the bots to keep them occupied

Crazy amount of guests

Well-known member

Member

Well-known member

Well-known member

Active member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Similar threads

We value your privacy