Tired of AI scraping your site?

Paul B

XenForo moderator
Staff member
Cloudflare has a solution: AI Audit.

Cloudflare, Inc. (NYSE: NET), the leading connectivity cloud company, today announced AI Audit, a set of tools to help websites of any size analyze and control how their content is used by artificial intelligence (AI) models. For the first time, website and content creators will be able to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it. Additionally, Cloudflare is developing a new feature where content creators can reliably set a fair price for their content that is used by AI companies for model training and retrieval augmented generation (RAG).

 
Another option is placing an ai.txt in your domain route. Did this about a year ago, not sure how much it helps, but it's there just in case.

You can build yourself a custom ai.txt file at this link:

This is my ai.txt content. I just block everything.

Code:
User-Agent: *

# Text Permissions
Disallow: *.txt
Disallow: *.pdf
Disallow: *.doc
Disallow: *.docx
Disallow: *.odt
Disallow: *.rtf
Disallow: *.tex
Disallow: *.wks
Disallow: *.wpd
Disallow: *.wps
Disallow: *.html

# Images Permissions
Disallow: *.bmp
Disallow: *.gif
Disallow: *.ico
Disallow: *.jpeg
Disallow: *.jpg
Disallow: *.png
Disallow: *.svg
Disallow: *.tif
Disallow: *.tiff
Disallow: *.webp

# Audio Permissions
Disallow: *.aac
Disallow: *.aiff
Disallow: *.amr
Disallow: *.flac
Disallow: *.m4a
Disallow: *.mp3
Disallow: *.oga
Disallow: *.opus
Disallow: *.wav
Disallow: *.wma

# Video Permissions
Disallow: *.mp4
Disallow: *.webm
Disallow: *.ogg
Disallow: *.avi
Disallow: *.mov
Disallow: *.wmv
Disallow: *.flv
Disallow: *.mkv

# Code Permissions
Disallow: *.py
Disallow: *.js
Disallow: *.java
Disallow: *.c
Disallow: *.cpp
Disallow: *.cs
Disallow: *.h
Disallow: *.css
Disallow: *.php
Disallow: *.swift
Disallow: *.go
Disallow: *.rb
Disallow: *.pl
Disallow: *.sh
Disallow: *.sql

# Disallow
Disallow: /
 
I block via ai.txt, robots.txt and Cloudflare setting.

For me, it's not how they use my content... it's that I don't want them to use it all!
 
All I think this will do is make a new industry.

One where OpenAI, et al., can say "I didn't know" or "We didn't have the computing power to gather content and run our models, so we had to outsource a portion of it."

See: Cambridge Analytica and Facebook.
 
Another option is placing an ai.txt in your domain route. Did this about a year ago, not sure how much it helps, but it's there just in case.

You can build yourself a custom ai.txt file at this link:

This is my ai.txt content. I just block everything.

Code:
User-Agent: *

# Text Permissions
Disallow: *.txt
Disallow: *.pdf
Disallow: *.doc
Disallow: *.docx
Disallow: *.odt
Disallow: *.rtf
Disallow: *.tex
Disallow: *.wks
Disallow: *.wpd
Disallow: *.wps
Disallow: *.html

# Images Permissions
Disallow: *.bmp
Disallow: *.gif
Disallow: *.ico
Disallow: *.jpeg
Disallow: *.jpg
Disallow: *.png
Disallow: *.svg
Disallow: *.tif
Disallow: *.tiff
Disallow: *.webp

# Audio Permissions
Disallow: *.aac
Disallow: *.aiff
Disallow: *.amr
Disallow: *.flac
Disallow: *.m4a
Disallow: *.mp3
Disallow: *.oga
Disallow: *.opus
Disallow: *.wav
Disallow: *.wma

# Video Permissions
Disallow: *.mp4
Disallow: *.webm
Disallow: *.ogg
Disallow: *.avi
Disallow: *.mov
Disallow: *.wmv
Disallow: *.flv
Disallow: *.mkv

# Code Permissions
Disallow: *.py
Disallow: *.js
Disallow: *.java
Disallow: *.c
Disallow: *.cpp
Disallow: *.cs
Disallow: *.h
Disallow: *.css
Disallow: *.php
Disallow: *.swift
Disallow: *.go
Disallow: *.rb
Disallow: *.pl
Disallow: *.sh
Disallow: *.sql

# Disallow
Disallow: /
Open AI has 100% ignored ai.txt within the last 6 months.
 
That may be true but at least they're nice enough to publish their bot names so they can be easily blocked in other ways. https://platform.openai.com/docs/bots

...and don't forget to tell the Wayback Machine not to archive your site or else that's just another place AI can scrape your data from.
They were scraping non-public domain books... OpenAI is going to scrape things regardless until there is regulation or laws against it.
 
They were scraping non-public domain books... OpenAI is going to scrape things regardless until there is regulation or laws against it.
And when they can't scrape it, they're just going to buy it (from themselves most likely).

The loopholes in the law will allow for it with as much lobbying power as they have. They'll push a bill through where AI can't scrape, but nothing about other companies taking publicly viewable data, "compiling it", and selling it to the highest bidder.

You're just telling OpenAI, other AI, and Wayback Machine to not scrape your content. And with the legislation, the government telling them not to.

But, can I scrape it if I can view it and sell it to OpenAI as an individual (if they close the loophole of corporations other than OpenAI from scraping)?

And if the legislation is made in the US to prevent companies or individuals from scraping and selling data, can a corporation or individual in China, Russia, or India scrape it to sell it to a US company? If they can't sell it to a US company, can they sell it to a UK company that resells it to the US company?

The point is that it's public information once you put it on the web. There will be no way to stop it. You can't stop someone from going to the public library and xeroxing an entire book because they don't want to check it out. I'm sure it's against the law, but they're not going to stop you or fine you or jail you for making a copy of a book.
 
I don't want to spread false information so please correct me if I'm wrong but doesn't this tool have the potential to affect your search engine rankings by blocking desirable crawlers?

Perhaps it's no longer an issue but I recall reading articles not so long ago suggesting that only those on a paid plan had the option to whitelist.
 
this seems free so far but idk what beta means or how their future plans look. once its all tuned and ironed out could end up costing....

then again, lol, they are probably scraping this comment im making now so i dont want to give them ideas ;P
 
Cloudflare has a solution: AI Audit. a pretty good marketing department
There is and most certainly never will be a waterproof way to really control usage of content (by AI crawlers or whatever) posted on public websites.
 
People need to save their crawl logs as proof their site was impacted.
Seems a waste of time. You'll never be able to prove that the data in the AI databases came from your site specifically. It is AI and it will have taken your data and similar data from another hundred sites combined it improved upon it and it technically is no longer your data. It's already smarter than us, there is no sense in fighting it.

Place bot blocks and ai.txt directives on your site, if it listens to them great if not forget about it. Life is too short to fight something you'll never win against.

Ron Popeil was a wise man...
1727278297743.webp
 
Back
Top Bottom