Cloudflare have a
new blog post today which touches on whether sites should filter based on whether a request is from a bot vs human or whether we should try to determine intent. It's a long read but goes to the heart of the issues we're seeing on this thread, and related to providing an open Web for everyone.
It is an interesting read but one should take it with loads of grains of salt. The issue with cloudflare is that they clearly see the problems, they see them at scale, they have clever people employed and thus are able to predict the future to a degree and to provide technical solutions. But: They do often (if not most of the time) have a very limited perspective which is caused by their business model and trying to be the jack of all trades. They are not - they are as well part of the problem as they are part of the solution. And often the people writing these articles take a lot of things as a given which aren't a given for many reasons: Because they are too young, because they are too tech friendly and because they follow the perspective that cloudlare as a company and business sets.
What this guy basically says is:
• it does not matter if a bot or a human visits your site, what matters are the intentions. And makes the example where someone books a concert ticket via an AI bot.
He looses me here already for a simple reason: I run a forum for humans. AI use is explicitely not allowed on my forum b/c is serves exhange betwteen humans. I do neither want nor need AI bots on my forum. End of the story.
Everything he makes up as an argument does not count for me. I don't care about advertising, about revenue, about growth or anything that might be of interest for a commercial platform or a marketplace/eCommerce-Platform. I simply want humans in (the right ones of coulre) and bots out.
The whole technical storm that he makes up ignores the basics of what he is talking about. The web started with things like Lynx - it was text based. Even today this is relevant as i.e. blind persons to use text-to-speech to access web pages. He never seems to have heard of these things and wants webpages to become AI-bot-friendly. Well, they would already be if not decades ago marketing people decided the web would have to be more fancy. In the olden days a webpage was something that was on a server and it had barely even picures and it was small. Today a webpage can easily be over 50MB and 90% of that is noise and wider parts of the content and the code are called from countless servers throughout the web. What once was simple has now become complex, yet the amount of information transferred has staed the same or rather it has degraded. And to serve all that ******** you need CDNs like cloudflare.
Search engines, the first bots on the net, were developed to deal with webpages that were made for humans. Over time that changed and webpages are now made for bots - for Google, to achieve ranking. The whole SEO industry makes a living from that and humans suffer b/c webpages are now full of ******** that does in no way serve humans but has the purpose to hopefully improve the SEO-ranking. It is obvious, that something went badly wrong at some point and not just once.
But now he wants webpages to become AI friendly. Why? The AI bros could easily take over waht's left from Zuckerbergs second life platform and create a web for AIs if they like. Nothing against it, but I won't participate. But leave me alone with your AI BS - I run a platform for humans and don't care for your AI, so leave me alone.
Then he moves over to a pretty relevant and interesting topic: Authentification and reputation management. But again with a limited perspective and so he misses the point again, due to his perspective. He focusses on what Cloudflare has done and tries to achieve - but I don't care about the interest of CF. He tries silently to push CFs initiative of bot authentification - which ist an intesting idea but in the end something that a) is not accepted in the market and b) serves the interest of CF to stabilize their de-facto-monopoly.
So in my eyes no doubt an interesting read but of not much use. The most intresting aspect of this article comes towards the end but is just mentioned but not solved: The consequence of the ongoing changes trough AI bots will be that the web will become less open and less accessible. B/c page owners rise barriers to protect their platforms and they are forced to do so. This - and at this point I agree with him - cannot be desirable. But it will happen if there won't be a regulating factor for AI. The AI bros do not respect any conventions or not enforced barriers - so w/o hard regulation the foreseeable outcome will be a web that is way more closed than today.
We can be lucky that we do run "just" forums: On other social media platform the bot issue has become so bad and the disinformation through bot postings and AI slop so high /namely on Twitter/X) that now there is an initiative within the EU for a closed social platform that is only accessible for humans and only after verification via passport. Not my favorite solution (but probably one that works). Someone has to tell those AI guys that their behaviour is inacceptable but people are to much in a gold rush, eager for money to do that.
They also have a
bunch of stats on the 'legit scrapers', so you can see the crawl to refer ratio for some of the AI services.
For some businesses who provide content (e.g. medical journals) they're looking into mechanisms to monetise scraping - pay-per-crawl, if you like so that they have some control/recompense from the legit scrapers.
The bot radar of CF has it's issues - they have been claiming about 30% of traffic coming from bots where we so see vastly different numbers on our forums. They only count bots that identify themselves as bots while we suffer from immense anonymous scraping traffic from data centers and resident proxies that they don't take into account. Also, they count any bot traffic equally, Google's search engine bot is the same as any scraper for training an LLM and I've not found a possibility to apply a useful filter onto their data.
What is depressing is that a whopping 75% of the requests of AI crawlers get a code 200 response according to the cloudflare radar - but again: This includes Google bot traffic for whatever reason, so in the end it is a random number. Also, if you trust them less than 3% of the webpages regulate the access of AI bots through robots.txt (3xx/10.000) - so it seems that the AI-bot issue is still not known to the majority of people running webpages.
I am following the radar and the blog posts of CF full of interest - but I do not trust their data to the full and I do not trust their initiatives as they obviously have their own good as the highest priority.