Crazy amount of guests

There's a few options but check out https://iplists.firehol.org/ which can help you integrate blocklists into a custom solution. Almost always what it is just scrapers and people training AI models. Like previous people indicated, Cloudflare is very good for handling automated traffic, but there are some things you can do manually if you have your own VPS/root access.
 
One common trend we've seen so far is that all the problems are members using their iphones on safari. Nobody else has a problem.

Probably these are then using Apple's privacy relay which is basically a VPN, but limited to Safari on Apple devices. I see this getting used by my users a lot as well. IP Threat monitor has an option to let those pass and - as the IP-lists are publicly available - it should be possible to integrate something like that in CF, too.
 
I think one of the biggest issues not being discussed here is, how much traffic does AI send a site nowadays? If you block the known entities from providing your site as a recommendation because it can't access it, then are you losing out?

Just putting this out there, which is why I have all AI bots on CF as allow, because a lot of people are now turning to AI in place of Google. I get better answers asking Claude or ChatGPT than I often do asking Google. You ask them a specific question and there is no nonsense listings in return, just what fits your question to them.
 
I think one of the biggest issues not being discussed here is, how much traffic does AI send a site nowadays? If you block the known entities from providing your site as a recommendation because it can't access it, then are you losing out?
Cloudflare gather stats on this, crawl-to-refer ratios, here are their stats from the past week:

1778574836212.webp

So Anthropic and OpenAI are pretty horrendous.....
 
I think one of the biggest issues not being discussed here is, how much traffic does AI send a site nowadays? If you block the known entities from providing your site as a recommendation because it can't access it, then are you losing out?

Just putting this out there, which is why I have all AI bots on CF as allow, because a lot of people are now turning to AI in place of Google. I get better answers asking Claude or ChatGPT than I often do asking Google. You ask them a specific question and there is no nonsense listings in return, just what fits your question to them.
Surely, what the best strategy is depends from how exclusive and how high quality the content of your forum is and how eager you are for growth or visitors. In general AI providers grab your content and give nothing back in exchange. Some are worse than others but overall you will lose: If AI has grabbed your content and serves it there is no need to visit your forum for people asking questions, even if there is a backlink.

Personally I do not care too much about new users and I clearly want to protect my forum content as it is pretty high quality and a lot of it is exclusive and cannot be found anywhere. How silly would it be to give that advantage away and even more for free? Plus enabling the AIs to grab all kinds of personal information that forum users may post with all potentially negative effects this may have.

So I do block scrapers and most AI agents. Depending from the AI and how well structured it's bots are it is sometimes possible to let the searh bot through while blocking the training bot. When in doubt I rather block completely. Furthermore I've set hard limitations to visibility for guests: They have always not been able to see some parts of the forum but most of it was freely accessible. I've changed this a while ago and now as a guest you can only see the first post of a thread and on top of that there are even more areas of the forum not accessible to guests.

Until now this has tremendously fostered registrations and at had least in the first months no negative impact on search engine ranking. I did not check this recently because I don't care too much. I do have a working community, I gain new users - what else could I want?

Surely, running a non-commercial forum helps, but I am part of the "lock them out as good as possible" cohort.
 
Apart from cloudfare (cost effective) is there any other way, someone came across blocking AI guests Spam ?
Did you read the thread you are posting to? There are loads of options mentioned including experiences with them.

Funnily enough, not too long ago you claimed in this very thread that CF would easily solve the bot issue

Go for cloudfare free dns and setup your site with Security Rules.
Give a managed challenge for all unverified bots (90% of your issue will be resolved)>

and have been hinted better to read the thread to understand the issue and the options available:

Maybe you should have read more than just the start posting of the thread but rather the 11 pages followig it until now. Then you'd have realized that your ill-led "advice" does not work at all.

Seems you have still not done so (plus, obviously, "SPAM" means active messaging, something which AI scraping bots don't do, so if you have a guest SPAM problem you could simply disallow guest posting on your forums).

So did your advice not work out as you claimed earlier?
 
Last edited:
In general AI providers grab your content and give nothing back in exchange. Some are worse than others but overall you will lose: If AI has grabbed your content and serves it there is no need to visit your forum for people asking questions, even if there is a backlink.
I agree with all of this.

Side note: I did see that Cloudflare is adding a service (I believe it's in a closed beta) where you can offer your data to AI scrapers for a payment. I guess one of the HTTP return codes, 402 Payment Required, is the mechanism they use, and from there they've found a way to implement payment.

But I agree with the moral implications of AI. They are essentially stealing all of our members' content, without permission, without license, without payment, to fuel their AI arms race which essentially are padding the profits of mega-corporations and keeping shareholders happy. The benefits are for them, not for society or end users like us.

In the fields I'm interested in, I have yet to see any AI answer be accurate. Some are so wildly off base that they are a joke. I refuse to use AI. I removed any AI apps on my devices. I turn it off in software where I can. I'm a grown-ass adult who learned how to think, write, research, do my own work, etc. on my own without needing machine assistance from a machine that is inherently faulty. I'll do things by hand, rather than rely on AI crutches.

Not only that, feeding AI scrapers posts from forums, Reddit, etc. is such a fatally flawed concept that the concept of garbage in/garbage out really applies here in full force. We don't see misinformation or inaccuracies as often here in XF's support forums, but go out in the general interest forums. Someone asks a question. They might get ten answers to their question...most of them wild-assed guesses or completely incorrect information. One post might get it right. (Automotive forums are a good example of this, especially when I may have the same question and get so many of the same inaccurate responses on completely different forums.) AI has no critical thinking skills, no common sense, no decision-making ability--it's a machine that spits out the garbage that it's fed. Why should I trust it when so much garbage is being fed into it?

So yeah...I'm blocking AI scrapers on forums. Any of them that I can. And if I can keep it out of the hands of companies doing this through questionable and dishonest means, I'm going to try anything possible to stop them. Ideally I would require everyone get a CF challenge, but whitelist them once they log in. But CF is flawed itself to a point that even when I put such things in place, CF finds some other way to block innocent legitimate users that are beyond my control. Which is what is happening right now.
 
Anubis is great if you're not using CF, they're both reverse proxies. You really only need one. CF though sits at your DNS, so anything you block at CF uses zero of your server resources, BUT, Anubis, sits on your server, and whilst it reduces server load significantly, attacks still use some of your server resources until blocked. CF also handles caching, SSL, the list goes on and on.
I think if I were to start tinkering with Anubis I'd rent a separate VPS to serve as dedicated proxy to keep the IP of the backend obscured. I'm sure it can be done for relatively cheap, around $10/month.
 
Just reporting in.
Fail2ban and some ASN blocks ( plus a PHP script i use to identify them easily from our apache logs ) is still kicking ass over here.

Never cross 10k guests anymore. Xenforo.com is using cloudflare and it sounds like they haven't taken any tuning tips from this thread. They can peak at 2-2.75x our guest count, otherwise the diference is we have ~-10% guests at any given moment. I'm assuming they are running a pretty stock configuration still.

We are very close to implementing anubis. Anubis can refer bans to fail2ban, which refer to Linux' iptables, a very fast mechanism.
We may modify the anubis integration so that it refers directly to iptables for maximum block speed.

1778607069113.webp

I think the combination will be powerful.
It would be awesome if we could refer the IP ban up to the provider's firewall. But that's a job for later. The move to Hetzner from AWS is about to save us a lot of money on bandwidth, maybe we don't need to go that far.
 
Last edited:
How silly would it be to give that advantage away and even more for free? Plus enabling the AIs to grab all kinds of personal information that forum users may post with all potentially negative effects this may have.
Don't we all do this now with search bots? We don't control what they do with our sites information, we don't control how they use our data or how much or little they recommend our site. I agree that the majority of AI bots, nobody wants their info with them. But the main players? The handful of big dogs?
Apart from cloudfare (cost effective) is there any other way, someone came across blocking AI guests Spam ?
Guest AI spam is super easy, place all guest posts into moderation AND ensure your server is locked down to CF IP's. This was my biggest issue that I found in this thread. I was manage challenging things, but they were bypassing CF to my IP directly, so they could do whatever they wanted. I have guest posting on, and I get maybe one human spam daily, no AI spam. XF has options to stop most of this, you just have to use them correctly.
I think if I were to start tinkering with Anubis I'd rent a separate VPS to serve as dedicated proxy to keep the IP of the backend obscured. I'm sure it can be done for relatively cheap, around $10/month.
The bandwidth is what becomes a problem with this for larger sites. Placing a VPS as a proxy to the entire site, you will have huge bandwidth to the proxy, then between the proxy and server, with everything passing through it. Probably works for a small site.
 
Don't we all do this now with search bots?
Not really. If the internet was a library the search bots are what feeds the library's index so that visitors will find the right book. The AI bots in opposite feed a person sitting in front of the library, answering the questions people want so solve directly b/c he has read all books in the library and charging money for it (while not even having bought or paid for the books himself).
We don't control what they do with our sites information, we don't control how they use our data or how much or little they recommend our site. I agree that the majority of AI bots, nobody wants their info with them. But the main players? The handful of big dogs?
I don't care too much if these are big dogs or small dogs. Only allowing big dogs while locking out small dogs (so using "size" as a criterium) has only one effect: It creates and feeds monopolies and this way we'll end up in the same situation with Google being by far the most (and often only) relevant search engine.

In the end it is a question of trust whome I let read my forums. With search engines that was relatively easy, the more, as the potential damage was small. With AI bots the potential damage is very high and there is barely any company that can be trusted. Clearly not the anonymous bots but also beyond the ones that identify themselves.
 
The AI bots in opposite feed a person sitting in front of the library, answering the questions people want so solve directly b/c he has read all books in the library and charging money for it (while not even having bought or paid for the books himself).
Just gonna say, you should patent that analogy. Pretty good. The only spanner I will throw into that, is that search engines ARE selling your information, and they use it to sell ads and placements. Two business models going about it differently, is all.
this way we'll end up in the same situation with Google being by far the most (and often only) relevant search engine.
JMO, I thought Google was the only relevant search engine already. :)
 
Last edited:
I will throw into that, is that search engines ARE selling your information, and they use it to sell ads and placements. Two business models going about it differently, is all.
In fact there was a bit of an analogy to the current situation a couple of years ago: Publishers sued Google back then b/c google was aggegating the content of their sites into it's own news.google.com offering w/o paying for the content. This led to some court ruling and Google stopping this practice and/or having to pay for the content in some countries.

Basically what Google did back then was to cross the border from a symbiosis to becoming a parasite. Which was even worse as due to the monopoly Google had in search it was impossible for the publishers to simply block Google as they would have made themselves invisible by doing so.

Today, this is even worse as the AI models are not only mirroring the content but outright creating clones, from books and novels in the style of a certain author, even using his plots over using the voices of famous actors to deep fakes.

The latest thing that I read (not checked if it is true) was that Antrophic would buy tons of rare and obscure used physical books in the areas of special interest (preferrably as many of the copies available as possible), scan the content, integrate it into their model and destroy the physical copies to gain exclusivity regarding content. If true this would indeed be again another level.
 
Back
Top Bottom