XF 2.3 AI crawlers causing xf_session_activity table to reach limit

bottiger

Active member
So my forum is now getting hammered by AI crawlers causing xf_session_activity to fill and stop the forum from loading. I've gotten over 10,000 unique ips.

Changing the table type to innodb is just a temporary bandaid.

I already have cloudflare blocking AI crawlers enabled but it doesn't work. The user-agents are all spoofing real browsers and the ips are all residential proxies.

The only thing that can stop them is to enable javascript verification for all users but this greatly annoys users.

Does anyone have a solution?
 
Obviously, there are many ways for a more fancy and more intelligent solution, but XF does lack even the plain basics, so having an easy to access toolkit for them would already improve the situation dramatically.
I disagree. If you are not on their cloud, then XF shouldn’t be responsible for your hosted site. It’s up to you as the owner to configure it to mitigate issues, or hire someone who knows how to.
 
one tiny feature enhancement that would help me would be ability to sort the Current visitors page by IPs. would be easier to see which ip ranges are hammering the forum at any point of time. of course, ability to block ip/ip-range/user-agents from this page would make it a bit easier but i see that it is probably handled at server level and not software level.
 
  • Like
Reactions: xml
I disagree. If you are not on their cloud, then XF shouldn’t be responsible for your hosted site. It’s up to you as the owner to configure it to mitigate issues, or hire someone who knows how to.
I don't say XF would be responsible for my (self)hosted site - where did you read that? They are insofar as they should provide a safe application - which they do. They could go beyond the current level by providing insights and tools. It is more in the sense that, if you bought a car in the olden days, you'd typically get a set of tools with it. It is not that the manufacturer would be responsible to repair your car on the road but he can make it easier for you deal with it yourself. Following your argumentation one could say: A car manufacturer should not deliver a car with a speedometer, a fuel gage let alone a gage for oil pressure. It is not his responsibility how fast you drive, if you run out of fuel or if you wreck your motor. It is your responsibility as the owner and driver to deal with all that. No doubt, it is. Yet still cars come with speedometers, fuel gages and sometimes even oil-pressure gages as standard from factory.
 
Yep if Xenforo could include a best practices document, then people who buy it for self hosted purposes can have something other than a bad out of the box experience in the future. They can either do this, or field a lot of support requests for the issue that's getting worse.

My guest count went up to 3000 today but i see 13000 guests on the Xenforo site, so given that we have the same size of content, i'd say that's bad.


Wonder how you other cloudlfare users are faring.

Lately i notice a ton of Chinese bots which are using a very slow scraping method with a very large number of IP addresses. It is clearly a click farm using a massive number of devices, because it's making my google analytics skyrocket, unlike previous bot/scraper attacks.

I'm a bit motivated to essentially make an all local equivalent of the things you can do with cloudflare, but a little more intelligent than that so that operators of independent websites can still operate independent websites today.
 
insta ban on certain pages that attract bots ( honeypot )
I think Cloudflare now has something similar to that, but without directly calling it a honeypot.


The AI Labyrinth adds invisible links on your webpage with specific Nofollow tags to block AI crawlers that do not adhere to the recommended guidelines and crawl without permission. AI crawlers that scrape your website content without permission will be stuck in a maze of never-ending links, and their details are recorded and used by all Cloudflare customers who choose to block AI bots.
These links do not impact your search engine optimization (SEO) or your website's appearance, and are only seen by bots. AI bots that respect no-crawl instructions will safely ignore this honeypot.

I feel like most general things you can slap on to a xenforo install are going to be inadequate soon
Agreed, and I've never wanted to do this. Rejecting bots, crawlers, and bad actors is the job of the server (firewall, fail2ban, etc.), not the software running on it (XenForo or anything else).

XenForo could provide some newer tools to get a better view of traffic, but the actual blocking and mitigating of the bad actors does not belong there...or even in htaccess for that matter, as it's then taxing the web server process. I've had fail2ban since I set up my own servers several years ago, and I'm trying to get through Cloudflare yet again as they add new options, change the names of others, change the locations in their menus, and redo their dashboard, each time making it harder to find what I'm looking for.

Lately i notice a ton of Chinese bots
On a somewhat related note, I long ago blocked the Baidu bot using the ASN. I found more Baidu bots today, and found two new ASNs I didn't already have in my block list. I don't see why they should have so many, unless they find ways to get new ASNs assigned to them so they can circumvent those of us blocking them.
 
Yeah the question is..... how well does the ai labyrinth work? i haven't seen anyone actually use it. I'd say it sounds like a good idea.

I don't find fail2ban is taxing on my server at all, it uses iptables to block traffic, which is the lowest level network function in linux, which is incredibly fast and it's flung off some 1000's of IP addresses big DDOS attacks on 2 core servers, amazingly.

I also filter the apache logs so they don't include css, js, png, gif, jpg, etc.. therefore my logs are 1/20th the size, reducing the amount of work fail2ban has and allowing me to set stricter rules :)

I have a nice report of what IP addresses in the AWS signapore block are scraping the site.. used AI to write the analysis script.. and used AI to write a denylist of the top 90% offenders.

here's what this analyzer looks at, it's counting hits per day and what kind of common response codes the user got:

1757534922749.webp

Note, before i put some rules in against these scrapers, they were getting in ~2000 hits per day.

Here's something you can copy paste into an apache .htaccess to kill the top 90% of this botnet.
( unfortunately AWS singapore has a super scattered network of addresses so nothing groups well )

Deny from 3.0.36.93
Deny from 3.0.37.233
Deny from 3.0.45.179
Deny from 3.0.47.233
Deny from 3.0.150.19
Deny from 3.0.166.8
Deny from 3.0.208.184
Deny from 3.1.28.195
Deny from 3.1.35.115
Deny from 3.1.41.121
Deny from 3.1.55.166
Deny from 3.1.140.69
Deny from 3.1.156.23
Deny from 3.1.175.59
Deny from 3.1.176.125
Deny from 3.1.232.118
Deny from 13.213.84.21
Deny from 13.213.93.162
Deny from 13.213.127.104
Deny from 13.213.171.53
Deny from 13.213.177.145
Deny from 13.213.193.94
Deny from 13.213.209.156
Deny from 13.213.233.255
Deny from 13.214.0.180
Deny from 13.214.22.82
Deny from 13.214.103.223
Deny from 13.214.116.108
Deny from 13.214.244.23
Deny from 13.215.52.199
Deny from 13.215.112.129
Deny from 13.215.124.152
Deny from 13.215.168.159
Deny from 13.215.188.31
Deny from 13.215.229.29
Deny from 13.215.231.211
Deny from 13.228.16.28
Deny from 13.228.74.91
Deny from 13.228.88.208
Deny from 13.228.118.185
Deny from 13.228.137.154
Deny from 13.228.227.67
Deny from 13.229.12
Deny from 13.229.16.251
Deny from 13.229.27.234
Deny from 13.229.31.250
Deny from 13.229.33.21
Deny from 13.229.165.132
Deny from 13.229.166.109
Deny from 13.250.82.126
Deny from 13.250.87.217
Deny from 13.250.132.236
Deny from 13.250.138.232
Deny from 13.250.142.18
Deny from 13.250.202.73
Deny from 13.250.203.189
Deny from 13.250.227.154
Deny from 13.250.252.231
Deny from 13.251.4.217
Deny from 13.251.39.246
Deny from 13.251.50.26
Deny from 13.251.83.43
Deny from 13.251.147.94
Deny from 13.251.241.103
Deny from 18.136.0.106
Deny from 18.136.10.12
Deny from 18.136.16.225
Deny from 18.136.36.22
Deny from 18.136.62.21
Deny from 18.136.83.96
Deny from 18.136.115.59
Deny from 18.136.128.0
Deny from 18.136.160.164
Deny from 18.136.192.146
Deny from 18.136.246.89
Deny from 18.138.13.230
Deny from 18.138.67.243
Deny from 18.138.91.172
Deny from 18.138.160.105
Deny from 18.138.162.156
Deny from 18.138.168.48
Deny from 18.138.206.198
Deny from 18.139.19.113
Deny from 18.139.55.40
Deny from 18.139.77.112
Deny from 18.139.81.237
Deny from 18.140.46.18
Deny from 18.140.93.125
Deny from 18.140.103.95
Deny from 18.140.106.216
Deny from 18.140.141.71
Deny from 18.140.152.11
Deny from 18.140.180.181
Deny from 18.141.87.4
Deny from 18.141.118.94
Deny from 18.141.136.97
Deny from 18.141.242.15
Deny from 18.141.253.176
Deny from 18.142.25.163
Deny from 18.142.89.160
Deny from 18.142.97.33
Deny from 18.142.147.98
Deny from 18.142.153.169
Deny from 18.142.191.196
Deny from 18.142.233.159
Deny from 18.143.56.174
Deny from 18.143.72.152
Deny from 18.143.82.121
Deny from 18.143.128.13
Deny from 18.143.238.96
Deny from 18.143.245.207
Deny from 44.225.224.39
Deny from 46.137.206.146
Deny from 46.137.229.112
Deny from 46.137.241.255
Deny from 46.137.250.236
Deny from 47.128.161.133
Deny from 47.128.189.246
Deny from 47.130.17.238
Deny from 47.130.30.119
Deny from 47.130.41.114
Deny from 47.130.45.182
Deny from 47.130.48.168
Deny from 47.130.57.161
Deny from 47.130.61
Deny from 47.130.63.172
Deny from 47.130.69.52
Deny from 47.130.73.217
Deny from 47.130.76.37
Deny from 47.130.79.23
Deny from 47.130.80.38
Deny from 47.130.86.31
Deny from 52.74.58.47
Deny from 52.74.82.10
Deny from 52.74.129.116
Deny from 52.74.142.177
Deny from 52.74.177.159
Deny from 52.74.185.192
Deny from 52.74.204.10
Deny from 52.74.206.17
Deny from 52.74.247.162
Deny from 52.74.253.55
Deny from 52.76.105.201
Deny from 52.76.111.236
Deny from 52.76.247.228
Deny from 52.77.43.229
Deny from 52.77.55.177
Deny from 52.77.143.193
Deny from 52.77.167.177
Deny from 52.77.171.255
Deny from 52.77.190.52
Deny from 52.77.201.81
Deny from 52.220.55.39
Deny from 52.220.69.218
Deny from 52.220.91.113
Deny from 52.220.99.112
Deny from 52.220.106.52
Deny from 52.220.113.184
Deny from 52.220.130.120
Deny from 52.220.174.74
Deny from 52.220.206.48
Deny from 52.220.240.16
Deny from 52.221.7.10
Deny from 52.221.20.215
Deny from 52.221.80.233
Deny from 52.221.86
Deny from 52.221.108.21
Deny from 52.221.140.211
Deny from 54.68.116.64
Deny from 54.151.168.53
Deny from 54.151.185.30
Deny from 54.151.249.108
Deny from 54.151.255.242
Deny from 54.169.18.221
Deny from 54.169.131.194
Deny from 54.169.145.89
Deny from 54.169.160.27
Deny from 54.169.227.4
Deny from 54.179.3.9
Deny from 54.179.63.7
Deny from 54.179.79.203
Deny from 54.251.56.172
Deny from 54.251.77.221
Deny from 54.251.100.82
Deny from 54.251.125.148
Deny from 54.251.197.106
Deny from 54.251.227.252
Deny from 54.251.253.210
Deny from 54.251.255.21
Deny from 54.254.5.110
Deny from 54.254.24.54
Deny from 54.254.35.214
Deny from 54.254.37.200
Deny from 54.254.54.29
Deny from 54.254.108.206
Deny from 54.254.110.64
Deny from 54.254.148
Deny from 54.254.208.62
Deny from 54.254.218.15
Deny from 54.255.26.49
Deny from 54.255.66.232
Deny from 54.255.157.237
Deny from 54.255.218.6
Deny from 122.248.208.194
Deny from 122.248.249.153
Deny from 175.41.131.8
Deny from 175.41.150.136

This will return a 4xx code which fail2ban picks up and then later blocks them at the OS networking level.
 
XenForo could provide some newer tools to get a better view of traffic, but the actual blocking and mitigating of the bad actors does not belong there...or even in htaccess for that matter, as it's then taxing the web server process. I've had fail2ban since I set up my own servers several years ago, and I'm trying to get through Cloudflare yet again as they add new options, change the names of others, change the locations in their menus, and redo their dashboard, each time making it harder to find what I'm looking for.

Speaking of..
Attached is a log analyzer for nginx and apache
This allows me to see bots that i didn't normally see before just reading the logs.

If in your nginx/apache configuration you are filtering out css/js/jpeg/etc then the numbers coming out of these scripts are going to reflect only hits that result in PHP execution so it will give you a great view into who is the scrapers. :)

I used these scripts to produce the above report and made the deny from list based on that.
 

Attachments

but i see 13000 guests on the Xenforo site
While Cloudflare offers many security features to protect websites, including xenForo.com, it's not a silver bullet against all types of bot attacks. Bot swarms can be very sophisticated, and sometimes they can overwhelm even the best defenses. This issue is a global crisis similar to the coronavirus pandemic.
 
Yeah the question is..... how well does the ai labyrinth work? i haven't seen anyone actually use it. I'd say it sounds like a good idea.
It's very new, and I haven't heard of anyone using it yet either. I have always liked the honeypot idea though, as a bot isn't smart enough to avoid a "special" link, and we can safely easily say that anything visiting that link shouldn't be doing so, and block them because of it.

Bot swarms can be very sophisticated, and sometimes they can overwhelm even the best defenses. This issue is a global crisis similar to the coronavirus pandemic.
There has been a really large jump in bots lately. I'd say, over the past few months?

It makes me wonder why they are visiting our forums, though. What purpose do they have in visiting so many threads? These bots, on one of our busiest forums, are digging up threads both recent and from the distant past, which were locked years ago. (We have content back to 2002.)

One useless bit of information. I posted this Cloudflare screenshot elsewhere (staff area of one of our forums) before I started round two of working on whacking these bots.

1757601633172.webp

US, UK, Canada, Germany...perhaps Australia...these are indicative of the forums' memberships. The rest? Brazil really stands out here. #2 on the list. That is quite a spike.
 
It's very new, and I haven't heard of anyone using it yet either. I have always liked the honeypot idea though, as a bot isn't smart enough to avoid a "special" link, and we can safely easily say that anything visiting that link shouldn't be doing so, and block them because of it.
Cloudflare was literally founded as a derivate of "Project Honeypot", so they should probably be familiar with the concept of a honeypot... ;)

There has been a really large jump in bots lately. I'd say, over the past few months?
The rise started last year in my opinion, it got pretty massive in late spring this year. If you look at this forum here reports about massive amounts of bots started in about June or July this year.
It makes me wonder why they are visiting our forums, though. What purpose do they have in visiting so many threads? These bots, on one of our busiest forums, are digging up threads both recent and from the distant past, which were locked years ago. (We have content back to 2002.)
The common assumption is that these are scrapers that try to feed AI models. Did you really not come about that? It has been written basically everywhere on this forum left and right. Even into the title of the thread that you are posting to.
 
Back
Top Bottom