I moved them to cloudflare (...) Interestingly, fail2ban still catches plenty of stuff
Opsie, Cloudflare still lets AI bots enter even with several preventive measures on the CF side.
It would be interesting to see if and how this changes, depending from the Cloudflare product (free vs. paid) one is using. According to their pricing matrix
one can probably not expect too much from the free tier as it only detects and stops simple bots. Juging from the description even the "Super Bot Fight Mode" in Business and Pro levels is rather marketing bling than a solution, judging from the description:
Bot Mitigation
Manage good and bad bots in real-time with speed and accuracy by harnessing the data from the millions of Internet properties on Cloudflare.
Content Scraping Protection
Protect all of your content including text, images and email addresses from web scrapers with Cloudflare's ScrapeShield™ service.
Free tier:
Bot Fight Mode
For an individual website. Challenge easy-to-detect bad bots from popular cloud providers.
Business and Pro tier (paid):
Super Bot Fight Mode
Block and challenge easy-to-detect bad bots from any source. Plus, bypass bot settings using WAF Custom Rules.
Only the enterprise tier offers more than that:
Bot Management
Manage AI crawlers and bot traffic to web and mobile apps without CAPTCHAs. Stop account abuse, malicious botnets, credential and card stuffing, content scraping, and inventory hoarding.
So it possibly comes down dot what they consider to be "easy to detect bots" and one can only hope that this is not limited to simple things like the transmitted user agent. In 2024 they invented the BotShield against AI Bots:
To help preserve a safe Internet for content creators, we’ve just launched a brand new “easy button” to block all AI bots. It’s available for all customers, including those on our free tier.
blog.cloudflare.com
However - as we see obviously the current scraping attacks are not stopped by cloudflare as people report in the forum here.
Searching for "resident proxies" on cloudflare.com shows mainly one result fro 2024:
Cloudflare's Bot Management team has released a new Machine Learning model for bot detection (v8), focusing on bots and abuse from residential proxies
blog.cloudflare.com
This refers to be used in the Product "bot management"
For existing Bot Management customers we recommend toggling “Auto-update machine learning model” to instantly gain the benefits of ML v8 and its residential proxy detection, and to stay up to date with our future ML model updates. If you’re not a Cloudflare Bot Management customer, contact our sales team to try out Bot Management.
and this again is included only in the highest paid plan (see table higher up in this post).
Cloudflare Bot Management uses global threat intelligence and machine learning to stop attacks—delivering powerful, automated bot protection.
www.cloudflare.com
So there is no such thing as free lunch with Cloudflare as it seems. As usual: If you are not paying you are not the customer but the product. Cloudflare do offer their free tiers which offer some basic benefits - but they need those customers to be able to gain data and insights at scale that is then used in their paid products (only). Not surprising and in my eyes nothing to really complain about. Just somewhat surprising that a lot of people on this forums don't stop falsely claiming that Cloudflare's free tier would solve the bot problem.
The interesting question is how well Cloudflare detects bot and scraping traffic. According to their pretty interesting
radar bot traffic makes slightly more than 30% of the requests currently:
Interestingly, this went down a bit. If I remember correctly it was up to 40% a couple of weeks ago. This does include
all bots, legitimate bots like Googlebot as well as all sorts of shady ones. Judging from my own forums they do miss a fair bit of bot traffic then - I do have a share of on average ~40% of
unwanted bot traffic (excluding bots like Google Bot or Bing Bot who alone visit countless times per day), and it goes up to above 70% on bad days - and still I do not catch all of them, mainly not being able to identify resident proxies from within central Europe reliably. Also, as I do block detected bots on their first request obviously the percentage of bot traffic on my forums is somewhat lower as it would be if I would leave them through, performing as many requests as they would like.
Obviously I don't know how "average" or typical my forum is compared to Cloudflare's average but judging from my numbers Cloudflare seems to miss a fair bit of bot traffic.
The source of bot traffic is mapped by Cloudflare like that:
Again a bit misleading, as it includes all bots, not just the bad ones. They also name the source ASNs:
These do not really fit the distribution I see on my forum. In contrast, they also map the percentage of bot traffic of all traffic per country and there you can smell the amount of resident proxies, especially in developing or smaller countries (along with the countries offering a lot of cloud datacenters and/or dodgy providers):
So overall it is somewhat unclear how comprehensively Cloudflare is able to detect bot traffic - but it seems that they possibly miss a fair bit on the one hand and that bot protection that tackles the current bot waves is only available in the highest paid plan (enterprise) anyway (and maybe possibly as a paid add on product on lower level plans). It remains somewhat vague what Cloudflare considers to be "simple bots" or "easy to detect bots", yet the level of protection offered by Cloudflare's free tier seems to be largely overrated when it comes to bots. The more as a lot of scraping bots lately claim to be able to overcome Cloudflare turnstile and other protection mechanisms like fingerprinting used to identify bots.