Bot Management using robots.txt in XFcloud

FoP · Jul 12, 2024

CTS said:
It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.

The same happened on our site. A few days after adding Bytespider to our robots.txt file they stopped visiting. Well today they’re back again. Four pages of them. Now what?

FoP · Jul 12, 2024

Chris D said:
It is concerning to me that Bytedance/spider are ignoring robots.txt. We may look at a more robust solution for this that we can implement centrally for all customers.

Is there a follow up on this maybe? We’re having tons of Bytespider bots at the moment and no way to stop them. They are ignoring our robots.txt file.

philmckrackon · Jul 12, 2024

FoP said:
Is there a follow up on this maybe? We’re having tons of Bytespider bots at the moment and no way to stop them. They are ignoring our robots.txt file.

Unfortunately this particular spider chooses to ignore the robots.txt file so the only way that works is via .htaccess but in the cloud you do not have access.

Post in thread 'Bot Management using robots.txt in XFcloud'

Jun 6, 2024

FoP said:
I added Bytespider and Bytedance to the Robots.txt file yesterday after Bytespider started showing up on our forum. They ignore the file. At this moment we have 50 robots and 26 are Bytespider so they multiplied. Banning IP addresses also doesn’t work. We too are on Cloud hosting.

Thank you. Will try this.

IMO, best to block by .htaccess
Eddit: Just noticed your on XF cloud and do not have access to the file.

Code:

BrowserMatchNoCase "Bytedance" bad_bot
BrowserMatchNoCase "Bytespider" bad_bot
BrowserMatchNoCase "Baiduspider" bad_bot
Order Deny,Allow
Deny from env=bad_bot...

FoP · Jul 12, 2024

philmckrackon said:
Unfortunately this particular spider chooses to ignore the robots.txt file so the only way that works is via .htaccess but in the cloud you do not have access.

Yes, and that’s a problem. But @Chris D said:

Chris D said:
We may look at a more robust solution for this that we can implement centrally for all customers.

And that’s what I’m hoping for.

FoP · Jul 12, 2024

Is there a way to block an IP range? All IP addresses start with 47.128

They seem most interested in our members’ images.

EDIT: I thought of a workaround. I put the whole damn 47.128 range in a severe discouragement mode. That worked.

CTS · Jul 12, 2024

CTS said:
It took about 2 to 3 weeks to see bytespider begin to comply with the suggested addition to robots txt. They do not visit anymore (so far) and it's been as long as my last post in this thread.

I can report that the quote above still holds true, and so far bytespider is complying and no longer sending bots, still.

FoP · Jul 13, 2024

CTS said:
I can report that the quote above still holds true, and so far bytespider is complying and no longer sending bots, still.

Then you were lucky, I guess. I added Bytespider to our robots.txt file and also modified the Page-container on 6 June. A few days after that they stopped visiting us. Until today, when they suddenly swarmed us. No idea why.

But as I said, I put the IP range in a severe discouragement mode and their numbers are now down. They still visit us but there are less of them now and they are all redirected to our homepage and no longer scraping images.

ShikiSuen · Saturday at 3:25 PM

Chris D said:

This is the default robots.txt:

Code:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

It's sufficient for most cases. If you want to add Bytespider it changes to:

Code:

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: Bytespider
Disallow: /

User-agent: *
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /misc/language
Disallow: /misc/style
Disallow: /posts/
Disallow: /login/
Disallow: /search/
Disallow: /whats-new/
Disallow: /admin.php
Allow: /

Sitemap: {sitemap_url}

These are far from being efficient for defending bots. At least, your list has lack of those search engines from Mainland China.

I'm sharing my plan here. This blocks Sogou search engine crawler.

NGINX:

# ========== [MUST BE THE FIRST] NEMESIS AGAINST SOGOU CRAWLER ==========
# 444: Cut TCP, GOING DARK, NO RESPONSE.
if ($http_user_agent ~* "Sogou") {
    return 444;
}

# 444 for other crawlers:
if ($http_user_agent ~* "(Sogou|360Spider|Bytespider|YisouSpider|Spider)") {
    return 444;
}

# Stop crawlers from touching these paths.
location ~* ^/(misc|data|error|files|install|internal_data|js|library|non_official_resources|src|styles|account|attachments|goto|posts|login|search|whats-new)/ {
    if ($http_user_agent ~* (spider|bot|crawl|slurp|Sogou)) {
        return 444;
    }
    try_files $uri $uri/ /index.php?$query_string;
}

# Stop crawlers from touching this php.
location ~* ^/(admin.php) {
    if ($http_user_agent ~* (spider|bot|crawl|slurp|Sogou)) {
        return 444;
    }
    try_files $uri $uri/ /index.php?$query_string;
}

smallwheels · Saturday at 3:31 PM

ShikiSuen said:
I'm sharing my plan here. This blocks Sogou search engine crawler.

I doubt that this will work within robots.txt in XF Cloud (which is the topic this thread is about).

ShikiSuen · Saturday at 4:08 PM

smallwheels said:
I doubt that this will work within robots.txt in XF Cloud (which is the topic this thread is about).

Hum... Sogou crawlers don't give robots.txt a f***. That's the problem.
My plan is to set a defense on the Nginx level. This let crawlers eat 444s but can't stop them from flooding against Nginx.

smallwheels · Saturday at 5:34 PM

ShikiSuen said:
Hum... Sogou crawlers don't give robots.txt a f***. That's the problem.
My plan is to set a defense on the Nginx level. This let crawlers eat 403s but can't stop them from flooding against Nginx.

Robots.txt relies on cooperation and today many bots and crawlers do not cooperate, so robots.txt is useless with them. .htaccess (and probably the Nginx aeqivalent as well) does not need cooperative bots and can do way more. So your idea is clearly the right direction.

But this thread is about the topic of robots.txt on XF cloud, not about .htaccess, which is not configurable on XF Cloud. There are loads of threads around the topic of bots and how to deal with them here on the forum and probably a bunch about .htaccess or limiting access via Nginx as well. Probably you should address your ideas in one of those as there will be more responses than in a thread dedicated to XF Cloud where the toolchain you want to use is not available.

Max Taxable · Saturday at 6:47 PM

smallwheels said:
Robots.txt relies on cooperation and today many bots and crawlers do not cooperate, so robots.txt is useless with them.

For those, robots.txt is a handy public list of sensitive places on the site!

duderuud · 2026-03-22T11:26:09+0000

If possible, put your website behind the Cloudflare proxy.

They are good at blocking "bad bots". They have a default rule for that and you can add your custom rules as well.

smallwheels · 2026-03-22T11:35:13+0000

duderuud said:
If possible, put your website behind the Cloudflare proxy.

They are good at blocking "bad bots". They have a default rule for that and you can add your custom rules as well.

Answering a two year old thread w/o even looking at the question that was asked

CTS said:
Anybody have experience in XFcloud in safe ways to add or modify to the robots.txt?

is always the most constructive way of dealing with it.

Seeker-Smith · 2026-03-22T11:42:00+0000

avalanch said:

in Page_Container template, modify this as needed.

Code:

<meta name="robots" content="noindex, nofollow, noarchive, noodp, nosnippet, notranslate, noimageindex">
<meta name="googlebot" content="noindex, nofollow">
<meta name="googlebot-news" content="nosnippet">
<meta name="googlebot-video" content="noindex">
<meta name="googlebot-image" content="noindex">
<meta name="bingbot" content="noindex, nofollow">
<meta name="bingpreview" content="noindex, nofollow">
<meta name="msnbot" content="noindex, nofollow">
<meta name="slurp" content="noindex, nofollow">
<meta name="teoma" content="noindex, nofollow">
<meta name="Yandex" content="noindex, nofollow">
<meta name="baidu" content="noindex, nofollow">
<meta name="Yeti" content="noindex, nofollow">
<meta name="ia_archiver" content="noindex, nofollow">
<meta name="facebook" content="noindex, nofollow">
<meta name="twitter" content="noindex, nofollow">
<meta name="rogerbot" content="noindex, nofollow">
<meta name="LinkedInBot" content="noindex, nofollow">
<meta name="embedly" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="W3C_Validator" content="noindex, nofollow">
<meta name="redditbot" content="noindex, nofollow">
<meta name="discordbot" content="noindex, nofollow">
<meta name="applebot" content="noindex, nofollow">
<meta name="pinterest" content="noindex, nofollow">
<meta name="smtbot" content="noindex, nofollow">
<meta name="googlewebmaster" content="noindex, nofollow">
<meta name="twitterbot" content="noindex, nofollow">
<meta name="tumblr" content="noindex, nofollow">
<meta name="slackbot" content="noindex, nofollow">
<meta name="flipboard" content="noindex, nofollow">
<meta name="qualaroo" content="noindex, nofollow">
<meta name="opensearch" content="noindex, nofollow">
<meta name="sogou" content="noindex, nofollow">
<meta name="exabot" content="noindex, nofollow">
<meta name="duckduckbot" content="noindex, nofollow">
<meta name="taptu" content="noindex, nofollow">
<meta name="outbrain" content="noindex, nofollow">
<meta name="Bytespider" content="noindex, nofollow">

There are a couple bot’s you should probably allow. Applebot Googlebot twitterbot as they do unfurl on phones

ShikiSuen · 2026-03-22T16:41:27+0000

This reply might still gets considered as out-of-topic, but a sad news:

Sogou crawlers start to pretent to be iPhone on detecting User-Agent-level ban.

Bot Management using robots.txt in XFcloud

FoP

Member

FoP

Member

philmckrackon

Well-known member

Post in thread 'Bot Management using robots.txt in XFcloud'

FoP

Member

FoP

Member

CTS

Active member

FoP

Member

ShikiSuen

Well-known member

smallwheels

Well-known member

ShikiSuen

Well-known member

smallwheels

Well-known member

Max Taxable

Well-known member

duderuud

Well-known member

smallwheels

Well-known member

Seeker-Smith

Well-known member

ShikiSuen

Well-known member

Similar threads

We value your privacy