disallow AI training using the forum's content

GeorgeS

Active member
Hello,


since the AI hype started, I noticed more than once that companies started to search our forum in a way that works almost as a DDOS. We are a small hobby project, and those request (that often do not follow robots.txt directives) block our bandwidth.

Amaz*n, Faceb**k, Huaw*i - I could block them more or less successful, but since a few days we get request (yesterday they crossed the 10,000 queries border, so I had to set the forum offline for the night). What frustrated me most was that at least one of the ignored the robots.txt setting.

The requests originate from two American ISPs (we host almost only German content). The user agent string looks like a normal browser request, e.g.

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Safari/537.36

but in such a mass it is not. Any idea how to stop it?

I miss any clear identification for that. Even the IPs come from five or six group, like 277.1xx.xxx.xxx (invalid example, I know, just to tell you what I can identify) and I don't want to blacklist that whole group of IPs.
 
bytespider went crazy for a while. petalbot too. and at one point of time i had to block applebot coz it was everywhere in logs. and then there are smaller companies like ahrefsbot which would hammer your server randomly. one has to decide if they are going to get any value from these bots. setting a crawl rate is probably a safe option assuming these bots obey that. after i moved on to a more powerful server, i mostly removed all blocks except for a few. in the end, the content is provided by the community... if it was a blog, i would have been more worried about it. forums can be full of garbage and inaccurate information. good luck to ai companies separating the good stuff from the bad.
 
cannot find a definitive list of ips used by bytedance. and they seem to use amazon aws for their spider which means you are going to block non bytespider ips as well in the end. in the end, it should not matter in almost all cases.
 
Use CF rate-limiting rules to capture all abuse requests.
Like, set 30 max requests per minute for all PHP requests.
 
Thank you for all the replies.

I don't mind if the content of the database shows up in Google or Bing, but I refuse to give bandwidth to someone from China (with IPs located in the US) who can't make real use of what we provide, and who behave like a bot but doesn't identify in any way.

With bots that do not behave I use the blockings that are possible in .htaccess, but here I really had to block the IP ranges that created the extreme traffic.

@James: I will see into the cloudflare option, but I needed a fast solution this time.
 
well he wants to block spiders owned by chinese companies that might be operating from usa and elsewhere. so region blocking is not an option for him.
 
All the legit AI bots identify themselves, so the OP cannot say with any certainty this was an AI bot. More likely than not, it's a run-of-the-mill site scraper.

In addition to robots.txt, you should also create an ai.txt file in the base directory and add this to it:

Code:
User-Agent: *
Disallow: /
Disallow: *

Even better, is to use a dynamic robots.txt generator, which logs hits, IP addresses, and user agents. They can send custom robots.txt files based on the crawler, as well as a default robots.txt for unknowns.
 
Worth mentioning here that apple confirmed this week that they are using applebot for their own AI training.

ai.txt seems like a proposed standard. Are existing AI companies supporting it?
 
The first 2 appear to be IPs belonging to Alibaba, Hong Kong and Seoul, the third would come from Japan and also from Alibaba cloud
IMG_2569.webp

The bots are being picked up as Singapore by Imunify360

Edit: I’ve just double checked, and it’s actually an AWS range they are using using that’s Singapore.

IMG_2570.webp
 
Last edited:
Thanks, Rusty Snippets -


I will give this a try, and will release Huawei for a few days from the PHP die;

But I suspect that companies that try to cloak they access will not follow such rules. It is my impression that they are striving for more information to get their AIs trained, so they ignore anything that would/could block their way.
 
Back
Top Bottom