Tired of AI scraping your site?

Paul B · Sep 24, 2024

Cloudflare has a solution: AI Audit.

Cloudflare, Inc. (NYSE: NET), the leading connectivity cloud company, today announced AI Audit, a set of tools to help websites of any size analyze and control how their content is used by artificial intelligence (AI) models. For the first time, website and content creators will be able to quickly and easily understand how AI model providers are using their content, and then take control of whether and how the models are able to access it. Additionally, Cloudflare is developing a new feature where content creators can reliably set a fair price for their content that is used by AI companies for model training and retrieval augmented generation (RAG).

Cloudflare Helps Content Creators Regain Control of their Content from AI Bots | Cloudflare

With new tools to automatically control how AI bots can access content, Cloudflare is the first to stand up for content creators at scale

www.cloudflare.com

bzcomputers · Sep 24, 2024

Another option is placing an ai.txt in your domain route. Did this about a year ago, not sure how much it helps, but it's there just in case.

You can build yourself a custom ai.txt file at this link:

Spawning.ai

We believe that a future of consenting data will benefit both AI development and the people it is trained on.

site.spawning.ai

This is my ai.txt content. I just block everything.

Code:

User-Agent: *

# Text Permissions
Disallow: *.txt
Disallow: *.pdf
Disallow: *.doc
Disallow: *.docx
Disallow: *.odt
Disallow: *.rtf
Disallow: *.tex
Disallow: *.wks
Disallow: *.wpd
Disallow: *.wps
Disallow: *.html

# Images Permissions
Disallow: *.bmp
Disallow: *.gif
Disallow: *.ico
Disallow: *.jpeg
Disallow: *.jpg
Disallow: *.png
Disallow: *.svg
Disallow: *.tif
Disallow: *.tiff
Disallow: *.webp

# Audio Permissions
Disallow: *.aac
Disallow: *.aiff
Disallow: *.amr
Disallow: *.flac
Disallow: *.m4a
Disallow: *.mp3
Disallow: *.oga
Disallow: *.opus
Disallow: *.wav
Disallow: *.wma

# Video Permissions
Disallow: *.mp4
Disallow: *.webm
Disallow: *.ogg
Disallow: *.avi
Disallow: *.mov
Disallow: *.wmv
Disallow: *.flv
Disallow: *.mkv

# Code Permissions
Disallow: *.py
Disallow: *.js
Disallow: *.java
Disallow: *.c
Disallow: *.cpp
Disallow: *.cs
Disallow: *.h
Disallow: *.css
Disallow: *.php
Disallow: *.swift
Disallow: *.go
Disallow: *.rb
Disallow: *.pl
Disallow: *.sh
Disallow: *.sql

# Disallow
Disallow: /

MySiteGuy · Sep 24, 2024

I block via ai.txt, robots.txt and Cloudflare setting.

For me, it's not how they use my content... it's that I don't want them to use it all!

🔥Iggy🔥 · Sep 25, 2024

these things are teenagers scraping gossip and telling us who we are? ill pass

frm · Sep 25, 2024

All I think this will do is make a new industry.

One where OpenAI, et al., can say "I didn't know" or "We didn't have the computing power to gather content and run our models, so we had to outsource a portion of it."

See: Cambridge Analytica and Facebook.

Forsaken · Sep 25, 2024

bzcomputers said:

Another option is placing an ai.txt in your domain route. Did this about a year ago, not sure how much it helps, but it's there just in case.

You can build yourself a custom ai.txt file at this link:

Spawning.ai

We believe that a future of consenting data will benefit both AI development and the people it is trained on.

site.spawning.ai

This is my ai.txt content. I just block everything.

Code:

User-Agent: *

# Text Permissions
Disallow: *.txt
Disallow: *.pdf
Disallow: *.doc
Disallow: *.docx
Disallow: *.odt
Disallow: *.rtf
Disallow: *.tex
Disallow: *.wks
Disallow: *.wpd
Disallow: *.wps
Disallow: *.html

# Images Permissions
Disallow: *.bmp
Disallow: *.gif
Disallow: *.ico
Disallow: *.jpeg
Disallow: *.jpg
Disallow: *.png
Disallow: *.svg
Disallow: *.tif
Disallow: *.tiff
Disallow: *.webp

# Audio Permissions
Disallow: *.aac
Disallow: *.aiff
Disallow: *.amr
Disallow: *.flac
Disallow: *.m4a
Disallow: *.mp3
Disallow: *.oga
Disallow: *.opus
Disallow: *.wav
Disallow: *.wma

# Video Permissions
Disallow: *.mp4
Disallow: *.webm
Disallow: *.ogg
Disallow: *.avi
Disallow: *.mov
Disallow: *.wmv
Disallow: *.flv
Disallow: *.mkv

# Code Permissions
Disallow: *.py
Disallow: *.js
Disallow: *.java
Disallow: *.c
Disallow: *.cpp
Disallow: *.cs
Disallow: *.h
Disallow: *.css
Disallow: *.php
Disallow: *.swift
Disallow: *.go
Disallow: *.rb
Disallow: *.pl
Disallow: *.sh
Disallow: *.sql

# Disallow
Disallow: /

Open AI has 100% ignored ai.txt within the last 6 months.

bzcomputers · Sep 25, 2024

Forsaken said:
Open AI has 100% ignored ai.txt within the last 6 months.

That may be true but at least they're nice enough to publish their bot names so they can be easily blocked in other ways. https://platform.openai.com/docs/bots

...and don't forget to tell the Wayback Machine not to archive your site or else that's just another place AI can scrape your data from.

Forsaken · Sep 25, 2024

bzcomputers said:
That may be true but at least they're nice enough to publish their bot names so they can be easily blocked in other ways. https://platform.openai.com/docs/bots

...and don't forget to tell the Wayback Machine not to archive your site or else that's just another place AI can scrape your data from.

They were scraping non-public domain books... OpenAI is going to scrape things regardless until there is regulation or laws against it.

frm · Sep 25, 2024

Forsaken said:
They were scraping non-public domain books... OpenAI is going to scrape things regardless until there is regulation or laws against it.

And when they can't scrape it, they're just going to buy it (from themselves most likely).

The loopholes in the law will allow for it with as much lobbying power as they have. They'll push a bill through where AI can't scrape, but nothing about other companies taking publicly viewable data, "compiling it", and selling it to the highest bidder.

You're just telling OpenAI, other AI, and Wayback Machine to not scrape your content. And with the legislation, the government telling them not to.

But, can I scrape it if I can view it and sell it to OpenAI as an individual (if they close the loophole of corporations other than OpenAI from scraping)?

And if the legislation is made in the US to prevent companies or individuals from scraping and selling data, can a corporation or individual in China, Russia, or India scrape it to sell it to a US company? If they can't sell it to a US company, can they sell it to a UK company that resells it to the US company?

The point is that it's public information once you put it on the web. There will be no way to stop it. You can't stop someone from going to the public library and xeroxing an entire book because they don't want to check it out. I'm sure it's against the law, but they're not going to stop you or fine you or jail you for making a copy of a book.

zappaDPJ · Sep 25, 2024

I don't want to spread false information so please correct me if I'm wrong but doesn't this tool have the potential to affect your search engine rankings by blocking desirable crawlers?

Perhaps it's no longer an issue but I recall reading articles not so long ago suggesting that only those on a paid plan had the option to whitelist.

🔥Iggy🔥 · Sep 25, 2024

this seems free so far but idk what beta means or how their future plans look. once its all tuned and ironed out could end up costing....

then again, lol, they are probably scraping this comment im making now so i dont want to give them ideas ;P

Suzanne O · Sep 25, 2024

Hey @Xon you might like to add this in your addon

Kirby · Sep 25, 2024

Paul B said:
Cloudflare has ~~a solution: AI Audit.~~ a pretty good marketing department

There is and most certainly never will be a waterproof way to really control usage of content (by AI crawlers or whatever) posted on public websites.

Digital Doctor · Sep 25, 2024

There will be massive lawsuits against AI.

People need to save their crawl logs as proof their site was impacted.

bzcomputers · Sep 25, 2024

Digital Doctor said:
People need to save their crawl logs as proof their site was impacted.

Seems a waste of time. You'll never be able to prove that the data in the AI databases came from your site specifically. It is AI and it will have taken your data and similar data from another hundred sites combined it improved upon it and it technically is no longer your data. It's already smarter than us, there is no sense in fighting it.

Place bot blocks and ai.txt directives on your site, if it listens to them great if not forget about it. Life is too short to fight something you'll never win against.

Ron Popeil was a wise man...

Tired of AI scraping your site?

Paul B

XenForo moderator

Cloudflare Helps Content Creators Regain Control of their Content from AI Bots | Cloudflare

bzcomputers

Well-known member

Spawning.ai

MySiteGuy

Well-known member

🔥Iggy🔥

Well-known member

frm

Well-known member

Forsaken

Well-known member

Spawning.ai

bzcomputers

Well-known member

Forsaken

Well-known member

frm

Well-known member

zappaDPJ

Well-known member

🔥Iggy🔥

Well-known member

Suzanne O

Well-known member

Kirby

Well-known member

Digital Doctor

Well-known member

bzcomputers

Well-known member

Similar threads

We value your privacy