Crazy amount of guests

For anyone still fighting this issue, the non-believers, fixing the issue at the source as I outlined on 31 Mar: https://xenforo.com/community/threads/crazy-amount-of-guests.233649/page-17#post-1777817 and as above, problem solved.

See the results below, rubbish gone, requests down, caching up and steady now to around 65% daily, served data down. IT WORKS! The server is isolated to cloudflare IP's only, so everything has to go through it, no direct IP access. That took my traffic from 1M daily uniques to my real traffic, around 110k - 140k daily now. Very steady, no chasing IP's to ban, ASN's, etc. Tuned server, locked to CF, /search/ manage challenged.

That is a huge improvement in my eyes.

View attachment 336528
This is the way. Welcome back to the ways of sanity!

Once people begin wizening up to this mass AI-scraping garbage, the better. I think the best way to put it is like this: It's your data, on your servers, it shouldn't be a smorgasbord for abusive AI botnets that may/will profit off a community you've shaped and built over the years driving away future growth.

I'm all for legitimate indexing of content. However, not for AI entities to mass ingest content and completely ignore rate limits by what is effectively a botnet. Less the fact that these AI networks not even giving a source/link-back to where the content was ingested from.



In other news, it looks like these forums are also being bombarded with scrapers tonight...
brave-04-20_20-32-1099bab8-7c1f-42f9-92dd-ce089233804c.webp
 
I think the best way to put it is like this: It's your data, on your servers, it shouldn't be a smorgasbord for abusive AI botnets that may/will profit off a community you've shaped and built over the years driving away future growth.
The thing is: With the current situation even as a forum user it is worth thinking about wether a forum is somewhat protected against AI scrapers and if you want your posts to become part of a commercial AI system. You may not want it for either reasons of privacy or to not support commercial systems (that will charge you to use them) for free with your knowledge. So one might limit what one posts on an unprotected forum of possibly stop posting alltogether on such a forum.

So basically protection agains AI scraping bots has become part of resposible forum administration and running today out of respect for the users and to protect their privacy and content. I think any forum admin should take this in consideration, indedpendently from performance or other reasons. It is even something that could be used as a marketing point for the own forum towards the users.

Even a forum like the XF forums here do contain loads of more or less private data that people post over time. I do not feel well with the ignorance that XF shows towards the scraping bots in this forum.
 
The thing is: With the current situation even as a forum user it is worth thinking about wether a forum is somewhat protected against AI scrapers and if you want your posts to become part of a commercial AI system. You may not want it for either reasons of privacy or to not support commercial systems (that will charge you to use them) for free with your knowledge. So one might limit what one posts on an unprotected forum of possibly stop posting alltogether on such a forum.

So basically protection agains AI scraping bots has become part of resposible forum administration and running today out of respect for the users and to protect their privacy and content. I think any forum admin should take this in consideration, indedpendently from performance or other reasons. It is even something that could be used as a marketing point for the own forum towards the users.

Even a forum like the XF forums here do contain loads of more or less private data that people post over time. I do not feel well with the ignorance that XF shows towards the scraping bots in this forum.

You're creating problems you can't overcome!
If your frum is public, bots will scan it and use it to train their AI, and there's nothing you can do about it because it's exactly like a real user learning something from your forum and then applying the news to their knowledge.
 
If your frum is public, bots will scan it and use it to train their AI, and there's nothing you can do about it
Obviously wrong - which you would know if you had read the thread that you are posting to. On top of that you can limit the amount of content that guests are able to see with various methods plus - only effective in theory or hindsight - you can add to your TOS that scraping and using the forum content in AI models is not allowed.
because it's exactly like a real user learning something from your forum and then applying the news to their knowledge.
I've not hat real users coming to my forum in thousands of sessions in parallel, hiding their identity by using resident proxies, scraping my forum at scale and then offering the knowledge they gained from my forum publicly commercially at scale and exploiting the privacy of it's users. So no, it is clearly not the same thing.
 
The specific expression for search is: (http.request.uri.path contains "/search/" and not http.cookie contains "xf_user=")
Maybe it is a dumb question or goes over the top: The "xf_user"-cookie is used by any Xenforo-Forum, so do you specifically check if it is valid for your forum domain (and possibly what the content may be) or just if it exists? Else, at least in theory, one could simply create a cookie with that name and pass any restrictions or anyone who is logged in in any Xenforo-forum would pass the check.

As it seems to work for you everything should be fine, at least for the moment - I am just wondering how easy it would be to bypass or trick the checks.
 
At the moment, most of the bad scrapers aren't smart enough to add those cookies in. They often don't even use valid referrers for their requests.
But they could inject xf_user cookies in the future. Or they could even register a forum account and appear even more like a legit user, as spammers do at the moment.

It becomes an escalating war. For each indicator that they're not a legit user we can add rules. They'll adapt and we'll need new mechanisms to keep them at bay. At some point, the cost of scraping may rise to the point where it's not economical for bad scrapers to continue scraping, but their ability to adapt will also improve as they use agents to evolve their tools.

Cloudflare have a new blog post today which touches on whether sites should filter based on whether a request is from a bot vs human or whether we should try to determine intent. It's a long read but goes to the heart of the issues we're seeing on this thread, and related to providing an open Web for everyone.

They also have a bunch of stats on the 'legit scrapers', so you can see the crawl to refer ratio for some of the AI services.
For some businesses who provide content (e.g. medical journals) they're looking into mechanisms to monetise scraping - pay-per-crawl, if you like so that they have some control/recompense from the legit scrapers.
 
Last edited:
Cloudflare have a new blog post today which touches on whether sites should filter based on whether a request is from a bot vs human or whether we should try to determine intent. It's a long read but goes to the heart of the issues we're seeing on this thread, and related to providing an open Web for everyone.
It is an interesting read but one should take it with loads of grains of salt. The issue with cloudflare is that they clearly see the problems, they see them at scale, they have clever people employed and thus are able to predict the future to a degree and to provide technical solutions. But: They do often (if not most of the time) have a very limited perspective which is caused by their business model and trying to be the jack of all trades. They are not - they are as well part of the problem as they are part of the solution. And often the people writing these articles take a lot of things as a given which aren't a given for many reasons: Because they are too young, because they are too tech friendly and because they follow the perspective that cloudlare as a company and business sets.

What this guy basically says is:
• it does not matter if a bot or a human visits your site, what matters are the intentions. And makes the example where someone books a concert ticket via an AI bot.

He looses me here already for a simple reason: I run a forum for humans. AI use is explicitely not allowed on my forum b/c is serves exhange betwteen humans. I do neither want nor need AI bots on my forum. End of the story.

Everything he makes up as an argument does not count for me. I don't care about advertising, about revenue, about growth or anything that might be of interest for a commercial platform or a marketplace/eCommerce-Platform. I simply want humans in (the right ones of coulre) and bots out.

The whole technical storm that he makes up ignores the basics of what he is talking about. The web started with things like Lynx - it was text based. Even today this is relevant as i.e. blind persons to use text-to-speech to access web pages. He never seems to have heard of these things and wants webpages to become AI-bot-friendly. Well, they would already be if not decades ago marketing people decided the web would have to be more fancy. In the olden days a webpage was something that was on a server and it had barely even picures and it was small. Today a webpage can easily be over 50MB and 90% of that is noise and wider parts of the content and the code are called from countless servers throughout the web. What once was simple has now become complex, yet the amount of information transferred has staed the same or rather it has degraded. And to serve all that ******** you need CDNs like cloudflare.
Search engines, the first bots on the net, were developed to deal with webpages that were made for humans. Over time that changed and webpages are now made for bots - for Google, to achieve ranking. The whole SEO industry makes a living from that and humans suffer b/c webpages are now full of ******** that does in no way serve humans but has the purpose to hopefully improve the SEO-ranking. It is obvious, that something went badly wrong at some point and not just once.

But now he wants webpages to become AI friendly. Why? The AI bros could easily take over waht's left from Zuckerbergs second life platform and create a web for AIs if they like. Nothing against it, but I won't participate. But leave me alone with your AI BS - I run a platform for humans and don't care for your AI, so leave me alone.

Then he moves over to a pretty relevant and interesting topic: Authentification and reputation management. But again with a limited perspective and so he misses the point again, due to his perspective. He focusses on what Cloudflare has done and tries to achieve - but I don't care about the interest of CF. He tries silently to push CFs initiative of bot authentification - which ist an intesting idea but in the end something that a) is not accepted in the market and b) serves the interest of CF to stabilize their de-facto-monopoly.

So in my eyes no doubt an interesting read but of not much use. The most intresting aspect of this article comes towards the end but is just mentioned but not solved: The consequence of the ongoing changes trough AI bots will be that the web will become less open and less accessible. B/c page owners rise barriers to protect their platforms and they are forced to do so. This - and at this point I agree with him - cannot be desirable. But it will happen if there won't be a regulating factor for AI. The AI bros do not respect any conventions or not enforced barriers - so w/o hard regulation the foreseeable outcome will be a web that is way more closed than today.

We can be lucky that we do run "just" forums: On other social media platform the bot issue has become so bad and the disinformation through bot postings and AI slop so high /namely on Twitter/X) that now there is an initiative within the EU for a closed social platform that is only accessible for humans and only after verification via passport. Not my favorite solution (but probably one that works). Someone has to tell those AI guys that their behaviour is inacceptable but people are to much in a gold rush, eager for money to do that.


They also have a bunch of stats on the 'legit scrapers', so you can see the crawl to refer ratio for some of the AI services.
For some businesses who provide content (e.g. medical journals) they're looking into mechanisms to monetise scraping - pay-per-crawl, if you like so that they have some control/recompense from the legit scrapers.
The bot radar of CF has it's issues - they have been claiming about 30% of traffic coming from bots where we so see vastly different numbers on our forums. They only count bots that identify themselves as bots while we suffer from immense anonymous scraping traffic from data centers and resident proxies that they don't take into account. Also, they count any bot traffic equally, Google's search engine bot is the same as any scraper for training an LLM and I've not found a possibility to apply a useful filter onto their data.

What is depressing is that a whopping 75% of the requests of AI crawlers get a code 200 response according to the cloudflare radar - but again: This includes Google bot traffic for whatever reason, so in the end it is a random number. Also, if you trust them less than 3% of the webpages regulate the access of AI bots through robots.txt (3xx/10.000) - so it seems that the AI-bot issue is still not known to the majority of people running webpages.

I am following the radar and the blog posts of CF full of interest - but I do not trust their data to the full and I do not trust their initiatives as they obviously have their own good as the highest priority.
 
@smallwheels AI bots will always be one step ahead of you no matter what you do, so you can only resign yourself.
You can set up blocks on Cloudflare to slightly limit the consumption of your server resources, but you can never prevent AI bots from scanning your site, so the only thing you can do is accept (the present/future of the web) and resign yourself.
 
Maybe it is a dumb question or goes over the top: The "xf_user"-cookie is used by any Xenforo-Forum, so do you specifically check if it is valid for your forum domain (and possibly what the content may be) or just if it exists?
I don't believe they're that focused, however, you can simply add your URL as an AND as referrer. That is much easier to mask though than a cookie. You're thinking that these people are targeting YOU, personally, BUT they aren't. They are broad brushing parameters into software and catching what falls within those parameters. Targeting individual sites is time consuming, complex, and like chasing IP's or ASN's, because as they x, the site does y, now they have to do y and the site does z. Never ending loop. That is not what these people are about. Set and forget is their motive, then deal with the data they have captured.
 
Back
Top Bottom