Sad to hear that cloudflare is almost not working.
Cloudflare helps a lot, but you have to put many hours of work into the config.
I downloaded crawler IP json files from about 10 legitimate good crawlers and made a Cloudflare list from those IPs. Then I made a security rule that says:
(
not ip.src in $known_good_crawler_ip_addresses
and (
http.user_agent contains "GPTBot"
or http.user_agent contains "ChatGPT-User"
or http.user_agent contains "PerplexityBot"
or http.user_agent contains "Perplexity-User"
or http.user_agent contains "OAI-SearchBot"
or http.user_agent contains "DuckDuckBot"
or http.user_agent contains "DuckAssistBot"
or http.user_agent contains "integralads"
or http.user_agent contains "Criteo"
or http.user_agent contains "AmazonAdBot"
or http.user_agent contains "AppleBot"
or http.user_agent contains "bingbot"
)
)
Block the request.
About 75% of all requests with user-agents that match the above are forged.
Also turn on:
Definitely automated traffic: Managed Challenge
and
Likely automated traffic: Managed Challenge
and
Verified bots: Allow
and
Challenge all request ! from the US - (Brazil and Singapore are the two most offending, but the cumulative number of worldwide bad bot requests are astounding. Only 1.16% of these requests can answer the Cloudflare challenge)
and
build your own list of user-agents that should always be blocked. This will take a ton of time, but it's worth it.
and
Rate shape the useful bots that make too many requests. Some will respond well to a 429, others are just rude and won't.
However, NEVER block or rate shape Googlebot, else pay the price. The "Verified bots: Allow" has a good list of Googlebots and will allow legit requests from Google and block forged Googlebot user-agents. My point is just let Cloudflare manage Googlebot and don't do anything else.
This has reduced our "guests" from 20,000 to 6,000. 6,000 is only about 25% > the prebot storm.
If anyone wants my lists, I will be glad to share. I hate bad bots.