disallow AI training using the forum's content

GeorgeS · Jun 10, 2024

Hello,

since the AI hype started, I noticed more than once that companies started to search our forum in a way that works almost as a DDOS. We are a small hobby project, and those request (that often do not follow robots.txt directives) block our bandwidth.

Amaz*n, Faceb**k, Huaw*i - I could block them more or less successful, but since a few days we get request (yesterday they crossed the 10,000 queries border, so I had to set the forum offline for the night). What frustrated me most was that at least one of the ignored the robots.txt setting.

The requests originate from two American ISPs (we host almost only German content). The user agent string looks like a normal browser request, e.g.

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.91 Safari/537.36

but in such a mass it is not. Any idea how to stop it?

I miss any clear identification for that. Even the IPs come from five or six group, like 277.1xx.xxx.xxx (invalid example, I know, just to tell you what I can identify) and I don't want to blacklist that whole group of IPs.

James · Jun 10, 2024

Stick your forum behind cloudflare (free) and use their rules to block access.

Chromaniac · Jun 10, 2024

Yeah, only option is to block using IP ranges if available. But it is going to be a constant work to keep adding new IPs.

Google and OpenAI both have spiders that can be blocked through robots file (a list here). Worth doing it. But Google has mentioned that their AI overview can use data from regular google bot, so that does not help.

Google-Extended does not stop Google Search Generative Experience from using your site's content

MattW · Jun 10, 2024

Bytespider is the worse one I've seen. So much so, Imunify360 WAF blocks it now by default, but I've also had to block all their IP ranges with Cloudflare and local server firewall rules.

Chromaniac · Jun 10, 2024

bytespider went crazy for a while. petalbot too. and at one point of time i had to block applebot coz it was everywhere in logs. and then there are smaller companies like ahrefsbot which would hammer your server randomly. one has to decide if they are going to get any value from these bots. setting a crawl rate is probably a safe option assuming these bots obey that. after i moved on to a more powerful server, i mostly removed all blocks except for a few. in the end, the content is provided by the community... if it was a blog, i would have been more worried about it. forums can be full of garbage and inaccurate information. good luck to ai companies separating the good stuff from the bad.

Deleted member 184953 · Jun 10, 2024

MattW said:
but I've also had to block all their IP ranges with Cloudflare and local server firewall rules.

Is those settings shareable or they are particular to your instance ? Thx.

Chromaniac · Jun 10, 2024

cannot find a definitive list of ips used by bytedance. and they seem to use amazon aws for their spider which means you are going to block non bytespider ips as well in the end. in the end, it should not matter in almost all cases.

MattW · Jun 10, 2024

Old Nick said:
Is those settings shareable or they are particular to your instance ? Thx.

I just blocked these 3 IP ranges:

They are all from Singapore, and no valid traffic to any sites I host is from there.

rdn · Jun 11, 2024

Use CF rate-limiting rules to capture all abuse requests.
Like, set 30 max requests per minute for all PHP requests.

GeorgeS · Jun 11, 2024

Thank you for all the replies.

I don't mind if the content of the database shows up in Google or Bing, but I refuse to give bandwidth to someone from China (with IPs located in the US) who can't make real use of what we provide, and who behave like a bot but doesn't identify in any way.

With bots that do not behave I use the blockings that are possible in .htaccess, but here I really had to block the IP ranges that created the extreme traffic.

@James: I will see into the cloudflare option, but I needed a fast solution this time.

CedricV · Jun 13, 2024

You can block any traffic from specific countries using the Cloudflare add on from Digitalpoint

[DigitalPoint] App for Cloudflare®

May 20, 2022

Configure and manage your Cloudflare account from within XenForo.

Chromaniac · Jun 13, 2024

well he wants to block spiders owned by chinese companies that might be operating from usa and elsewhere. so region blocking is not an option for him.

MySiteGuy · Jun 13, 2024

All the legit AI bots identify themselves, so the OP cannot say with any certainty this was an AI bot. More likely than not, it's a run-of-the-mill site scraper.

In addition to robots.txt, you should also create an ai.txt file in the base directory and add this to it:

Code:

User-Agent: *
Disallow: /
Disallow: *

Even better, is to use a dynamic robots.txt generator, which logs hits, IP addresses, and user agents. They can send custom robots.txt files based on the crawler, as well as a default robots.txt for unknowns.

Chromaniac · Jun 13, 2024

Worth mentioning here that apple confirmed this week that they are using applebot for their own AI training.

ai.txt seems like a proposed standard. Are existing AI companies supporting it?

MySiteGuy · Jun 13, 2024

AFAIK, so far only Spawning AI supports it, but it doesn't hurt to add it. robots.txt was slow to be adopted as well.

Deleted member 184953 · Jun 13, 2024

MattW said:
I just blocked these 3 IP ranges:

View attachment 304072

They are all from Singapore, and no valid traffic to any sites I host is from there.

The first 2 appear to be IPs belonging to Alibaba, Hong Kong and Seoul, the third would come from Japan and also from Alibaba cloud

MattW · Jun 14, 2024

Old Nick said:
The first 2 appear to be IPs belonging to Alibaba, Hong Kong and Seoul, the third would come from Japan and also from Alibaba cloud

The bots are being picked up as Singapore by Imunify360

Edit: I’ve just double checked, and it’s actually an AWS range they are using using that’s Singapore.

Rusty Snippets · Jul 28, 2024

I put an ai.txt file in root. It was the Spawning AI bot.

User-Agent: *
Disallow:/
Disallow: *

We'll see............

GeorgeS · Jul 28, 2024

Thanks, Rusty Snippets -

I will give this a try, and will release Huawei for a few days from the PHP die;

But I suspect that companies that try to cloak they access will not follow such rules. It is my impression that they are striving for more information to get their AIs trained, so they ignore anything that would/could block their way.

disallow AI training using the forum's content

Active member

Well-known member

Well-known member

Well-known member

Well-known member

Deleted member 184953

Guest

Well-known member

Well-known member

Well-known member

Active member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Deleted member 184953

Guest

Well-known member

Well-known member

Active member

We value your privacy