Bot Management using robots.txt in XFcloud

It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.
The same happened on our site. A few days after adding Bytespider to our robots.txt file they stopped visiting. Well today they’re back again. Four pages of them. Now what?
 
It is concerning to me that Bytedance/spider are ignoring robots.txt. We may look at a more robust solution for this that we can implement centrally for all customers.
Is there a follow up on this maybe? We’re having tons of Bytespider bots at the moment and no way to stop them. They are ignoring our robots.txt file.
 
Is there a follow up on this maybe? We’re having tons of Bytespider bots at the moment and no way to stop them. They are ignoring our robots.txt file.
Unfortunately this particular spider chooses to ignore the robots.txt file so the only way that works is via .htaccess but in the cloud you do not have access.

 
Is there a way to block an IP range? All IP addresses start with 47.128

They seem most interested in our members’ images.

EDIT: I thought of a workaround. I put the whole damn 47.128 range in a severe discouragement mode. That worked.
 
Last edited:
It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.
I can report that the quote above still holds true, and so far bytespider is complying and no longer sending bots, still.
 
I can report that the quote above still holds true, and so far bytespider is complying and no longer sending bots, still.
Then you were lucky, I guess. I added Bytespider to our robots.txt file and also modified the Page-container on 6 June. A few days after that they stopped visiting us. Until today, when they suddenly swarmed us. No idea why.

But as I said, I put the IP range in a severe discouragement mode and their numbers are now down. They still visit us but there are less of them now and they are all redirected to our homepage and no longer scraping images.
 
This is the default robots.txt:

Code:
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

It's sufficient for most cases. If you want to add Bytespider it changes to:

Code:
User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

These are far from being efficient for defending bots. At least, your list has lack of those search engines from Mainland China.

I'm sharing my plan here. This blocks Sogou search engine crawler.

NGINX:
# ========== [MUST BE THE FIRST] NEMESIS AGAINST SOGOU CRAWLER ==========
# 444: Cut TCP, GOING DARK, NO RESPONSE.
if ($http_user_agent ~* "Sogou") {
    return 444;
}

# 444 for other crawlers:
if ($http_user_agent ~* "(Sogou|360Spider|Bytespider|YisouSpider|Spider)") {
    return 444;
}

# Stop crawlers from touching these paths.
location ~* ^/(misc|data|error|files|install|internal_data|js|library|non_official_resources|src|styles|account|attachments|goto|posts|login|search|whats-new)/ {
    if ($http_user_agent ~* (spider|bot|crawl|slurp|Sogou)) {
        return 444;
    }
    try_files $uri $uri/ /index.php?$query_string;
}

# Stop crawlers from touching this php.
location ~* ^/(admin.php) {
    if ($http_user_agent ~* (spider|bot|crawl|slurp|Sogou)) {
        return 444;
    }
    try_files $uri $uri/ /index.php?$query_string;
}
 
Last edited:
I doubt that this will work within robots.txt in XF Cloud (which is the topic this thread is about).
Hum... Sogou crawlers don't give robots.txt a f***. That's the problem.
My plan is to set a defense on the Nginx level. This let crawlers eat 444s but can't stop them from flooding against Nginx.
 
Last edited:
Hum... Sogou crawlers don't give robots.txt a f***. That's the problem.
My plan is to set a defense on the Nginx level. This let crawlers eat 403s but can't stop them from flooding against Nginx.
Robots.txt relies on cooperation and today many bots and crawlers do not cooperate, so robots.txt is useless with them. .htaccess (and probably the Nginx aeqivalent as well) does not need cooperative bots and can do way more. So your idea is clearly the right direction.

But this thread is about the topic of robots.txt on XF cloud, not about .htaccess, which is not configurable on XF Cloud. There are loads of threads around the topic of bots and how to deal with them here on the forum and probably a bunch about .htaccess or limiting access via Nginx as well. Probably you should address your ideas in one of those as there will be more responses than in a thread dedicated to XF Cloud where the toolchain you want to use is not available.
 
If possible, put your website behind the Cloudflare proxy.

They are good at blocking "bad bots". They have a default rule for that and you can add your custom rules as well.
 
If possible, put your website behind the Cloudflare proxy.

They are good at blocking "bad bots". They have a default rule for that and you can add your custom rules as well.
Answering a two year old thread w/o even looking at the question that was asked

Anybody have experience in XFcloud in safe ways to add or modify to the robots.txt?

is always the most constructive way of dealing with it. ;)
 
in Page_Container template, modify this as needed.

Code:
<meta name="robots" content="noindex, nofollow, noarchive, noodp, nosnippet, notranslate, noimageindex">
<meta name="googlebot" content="noindex, nofollow">
<meta name="googlebot-news" content="nosnippet">
<meta name="googlebot-video" content="noindex">
<meta name="googlebot-image" content="noindex">
<meta name="bingbot" content="noindex, nofollow">
<meta name="bingpreview" content="noindex, nofollow">
<meta name="msnbot" content="noindex, nofollow">
<meta name="slurp" content="noindex, nofollow">
<meta name="teoma" content="noindex, nofollow">
<meta name="Yandex" content="noindex, nofollow">
<meta name="baidu" content="noindex, nofollow">
<meta name="Yeti" content="noindex, nofollow">
<meta name="ia_archiver" content="noindex, nofollow">
<meta name="facebook" content="noindex, nofollow">
<meta name="twitter" content="noindex, nofollow">
<meta name="rogerbot" content="noindex, nofollow">
<meta name="LinkedInBot" content="noindex, nofollow">
<meta name="embedly" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="W3C_Validator" content="noindex, nofollow">
<meta name="redditbot" content="noindex, nofollow">
<meta name="discordbot" content="noindex, nofollow">
<meta name="applebot" content="noindex, nofollow">
<meta name="pinterest" content="noindex, nofollow">
<meta name="smtbot" content="noindex, nofollow">
<meta name="googlewebmaster" content="noindex, nofollow">
<meta name="twitterbot" content="noindex, nofollow">
<meta name="tumblr" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="flipboard" content="noindex, nofollow">
<meta name="qualaroo" content="noindex, nofollow">
<meta name="opensearch" content="noindex, nofollow">
<meta name="sogou" content="noindex, nofollow">
<meta name="exabot" content="noindex, nofollow">
<meta name="duckduckbot" content="noindex, nofollow">
<meta name="taptu" content="noindex, nofollow">
<meta name="outbrain" content="noindex, nofollow">
<meta name="Bytespider" content="noindex, nofollow">
There are a couple bot’s you should probably allow. Applebot Googlebot twitterbot as they do unfurl on phones
 
This reply might still gets considered as out-of-topic, but a sad news:

Sogou crawlers start to pretent to be iPhone on detecting User-Agent-level ban.
 
Back
Top Bottom