Robots.txt and sitemap questions

Alvin63 · May 24, 2025

I've never actually used a robots.txt (didn't get round to it and it hadn't been an issue generally) and have been reading around on here to find various examples and still find it a bit confusing (due to there being different examples) and also not quite understanding some of it. I thought it was just to block robots from scanning the site, but it seems people are adding forum sections to it as well.

Last time I looked into this I thought you needed to add "Allow" at the end to allow google definitively?

Some explanation on it all would be gratefully received, along with a simple example. I currently have these bots crawling (in addition to Google):

Ahrefs, Bing, Petal search, Moz Dotbot- all of whom have been around for a long time, and I'm not aware of any issues related to them, but I've recently had some new ones:

Anthropic, ImageSift, Amazon and Bytedance. No idea where anthropic and imagesift came from or why amazon has suddenly popped up.

Also - is it actually essential to have robots.txt? Also is it essential to have your sitemap at the end as I'm still confused as to whether I have it as php or xml (I think it's php so can I actually put xml in the robots.txt?)

Edit: Occasionally have had Facebook and apple as well.

Max Taxable · May 24, 2025

It doesn't block anything.

bzcomputers · May 25, 2025

All robots.txt does is point "Good" site bots in the right direction. The file includes instructions on what parts of your website they should crawl and hopefully index. The purpose is to limit "Good" bots from processing unnecessary files without content on your site, in general this should reduce some bot traffic and server load. "Bad" robots are just going to do what the want no matter whether you have a robots.txt file or not.

There are enough "Good" bots out there that including a robots.txt file is important. The chief reason being that Google's bots are "Good" and Google represents over 90% of search engine traffic in the world currently. If your site is popular at all Google bots will be on it multiple times a day.

Your robots.txt file should be setup to disallow all areas of your site that do not have content relevant to index, or content that is not unique (already found elsewhere on your site). This will help improve SEO and saves crawler traffic for the important stuff. You do NOT have to set "Allow" for anything, unless you universally set a "Disallow" blocking all sections of your site to begin with. "Allow" is assumed by default whether you have a robots.txt file or not. So to make the most compact robots.txt file you can just set the specific sections of your site you want to disallow.

For a stock XenForo site an example robots.txt would be something like:

Code:

Sitemap: https://www.YourDomain.com/community/sitemap.xml

User-agent: *
Disallow: /community/admin.php
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/direct-messages/
Disallow: /community/goto/
Disallow: /community/login/
Disallow: /community/lost-password/
Disallow: /community/online/
Disallow: /community/posts/
Disallow: /community/register/
Disallow: /community/search/
Disallow: /community/whats-new/

If you don't want member pages indexed add:

Code:

Disallow: /community/members/

If you aren't using tags add:

Code:

Disallow: /community/tags/

If you're using Google Adsense add this just below your sitemap reference:

Code:

User-agent: Mediapartners-Google
Disallow:

There are no additional changes is you are using Media Gallery or Resource Manager.

Alvin63 · May 25, 2025

Thanks. So If I want to get rid of some bots altogether, can I juse name them and disallow everything?

bzcomputers · May 25, 2025

Alvin63 said:
Thanks. So If I want to get rid of some bots altogether, can I juse name them and disallow everything?

Assuming it is a "Good" bot and it will obey its' instructions, then yes you can set a disallow for the entire website specific to that bot. It would look like:

Code:

User-agent: SpecificBot
Disallow: /

Then keep adding this section to disallow all for each additional bot you want to block.

Common Bots to Consider Disallowing:

AhrefsBot
- Why: Used by Ahrefs for SEO analysis and backlink checking. It can heavily crawl your site, consuming bandwidth. Block if you don’t want your site’s data in their database.
- User-agent: AhrefsBot
SemrushBot
- Why: Similar to AhrefsBot, used by Semrush for SEO and competitive analysis. It can crawl aggressively, impacting server resources.
- User-agent: SemrushBot
MJ12bot(Majestic)
- Why: Crawls for backlink analysis. Known for heavy crawling, which can strain smaller servers.
- User-agent: MJ12bot
DotBot(Moz)
- Why: Used by Moz for SEO metrics. Can be resource-intensive, especially for small sites.
- User-agent: DotBot
Baiduspider
- Why: Baidu’s crawler (China’s search engine). If your site doesn’t target Chinese audiences, blocking it can reduce unnecessary traffic.
- User-agent: Baiduspider
YandexBot
- Why: Yandex’s crawler (Russia’s search engine). Block if your site isn’t relevant to Russian users to save resources.
- User-agent: YandexBot
ia_archiver(Archive.org’s Wayback Machine)
- Why: Archives your site for historical records. Block if you don’t want your content archived publicly.
- User-agent: ia_archiver
CommonCrawl
- Why: Used for open datasets and research. Can crawl heavily and may not benefit your site directly.
- User-agent: CCBot

Alvin63 · May 25, 2025

Thank you.

Alvin63 · May 25, 2025

The reason I'm asking about allow, is, there is an example on a Google page. Which shows allowing one bot but disallowing all others. Is that an option? To just allow Google and disallow all others? (The example is for Googlebot-news but I assume it could just say Google?). Also am I correct that there is a space before the slash?

User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /

Alvin63 · May 25, 2025

Although confusingly, in another example it has "allow" last instead of first

Disallow crawling of an entire site, but allow Mediapartners-Google This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors on your site.

User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /

bzcomputers · May 25, 2025

Alvin63 said:
The reason I'm asking about allow, is, there is an example on a Google page. Which shows allowing one bot but disallowing all others. Is that an option? To just allow Google and disallow all others? (The example is for Googlebot-news but I assume it could just say Google?). Also am I correct that there is a space before the slash?

User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /

It is an option, but I would say a completely wrong option.

Don't over complicate this and hurt your site by setting a Disallow for all. Your going from one bad: No robots.txt at all; to a worse bad by only allowing a single or few robots.

There are many "Good" bots, and many good bots from Google alone. They don't change often but they do change and Google does add new bots on occasion and so do other good bot sources. To name just a few:

Googlebot
- Purpose: Crawls and indexes websites for Google Search.
- Why It’s Good: Critical for ranking in Google, driving organic traffic.
- User-agent: Googlebot
Bingbot
- Purpose: Crawls and indexes websites for Microsoft’s Bing search engine.
- Why It’s Good: Increases visibility on Bing, reaching Microsoft ecosystem users.
- User-agent: Bingbot
DuckDuckBot
- Purpose: Crawls websites for DuckDuckGo, a privacy-focused search engine.
- Why It’s Good: Appeals to privacy-conscious users, expanding reach without tracking.
- User-agent: DuckDuckBot
Slurp(Yahoo)
- Purpose: Crawls sites for Yahoo Search (powered by Bing).
- Why It’s Good: Ensures visibility on Yahoo’s niche audience.
- User-agent: Slurp
Twitterbot
- Purpose: Crawls pages for Twitter Card previews on X.
- Why It’s Good: Enhances link previews with images and summaries on X, boosting engagement.
- User-agent: Twitterbot
Facebot(Facebook)
- Purpose: Crawls pages for link previews on Facebook.
- Why It’s Good: Improves content appearance on Facebook, driving social engagement.
- User-agent: Facebot
Applebot
- Purpose: Crawls websites for Siri and Spotlight Search.
- Why It’s Good: Makes content discoverable on Apple devices, reaching a large ecosystem.
- User-agent: Applebot
LinkedInBot
- Purpose: Crawls pages for link previews on LinkedIn.
- Why It’s Good: Enhances professional content sharing, ideal for business sites.
- User-agent: LinkedInBot
Pinterestbot
- Purpose: Crawls pages for Pinterest link previews and content discovery.
- Why It’s Good: Boosts visibility for visual content, driving traffic from Pinterest.
- User-agent: Pinterestbot
Googlebot-Image
- Purpose: Crawls images for Google Image Search.
- Why It’s Good: Drives traffic through image search, ideal for visual content.
- User-agent: Googlebot-Image
Googlebot-Video
- Purpose: Crawls videos for Google Video Search.
- Why It’s Good: Ensures videos are indexed, increasing discoverability.
- User-agent: Googlebot-Video
Googlebot-News
- Purpose: Crawls content for Google News.
- Why It’s Good: Boosts visibility for news content, reaching timely information seekers.
- User-agent: Googlebot-News
BingPreview
- Purpose: Captures snapshots for Bing’s search result previews.
- Why It’s Good: Enhances rich snippets in Bing, improving click-through rates.
- User-agent: BingPreview
Discordbot
- Purpose: Crawls pages for link previews in Discord chats.
- Why It’s Good: Improves content sharing in Discord’s community platform.
- User-agent: Discordbot
Slackbot
- Purpose: Crawls pages for link previews in Slack workspaces.
- Why It’s Good: Ensures clean previews in Slack, useful for professional collaboration.
- User-agent: Slackbot
WhatsAppBot
- Purpose: Crawls pages for link previews in WhatsApp messages.
- Why It’s Good: Enhances content sharing on WhatsApp, a major messaging platform.
- User-agent: WhatsApp
AdsBot-Google
- Purpose: Crawls pages to evaluate quality for Google Ads campaigns.
- Why It’s Good: Ensures landing pages meet Google Ads standards, improving ad performance.
- User-agent: AdsBot-Google
Google-InspectionTool
- Purpose: Crawls pages for Google Search Console to inspect URLs and diagnose indexing issues.
- Why It’s Good: Helps webmasters troubleshoot indexing, ensuring optimal Google performance.
- User-agent: Google-InspectionTool
Google-Site-Verification
- Purpose: Verifies site ownership for Google services like Search Console.
- Why It’s Good: Enables access to Google’s webmaster tools, critical for SEO monitoring.
- User-agent: Google-Site-Verification
Redditbot
- Purpose: Crawls pages for link previews on Reddit.
- Why It’s Good: Enhances content appearance on Reddit, driving engagement from its communities.
- User-agent: Redditbot
Google-Extended
- Purpose: Crawls pages for Google’s AI and extended services (e.g., Bard or other AI-driven features).
- Why It’s Good: Ensures content is available for Google’s AI-powered features, increasing visibility in emerging search formats.
- User-agent: Google-Extended
TelegramBot
- Purpose: Crawls pages for link previews in Telegram messages.
- Why It’s Good: Improves content sharing on Telegram, a popular messaging platform with privacy-focused users.
- User-agent: TelegramBot

Alvin63 · May 25, 2025

Thank you for the detailed info. So better to just disallow the ones you don't want and allow everything else then.

Alvin63 · May 25, 2025

Quick question about sitemap .xml - I vaguely remember my sitemap is php and presumably it needs to be one or the other? I tried adding both t the browser (php or xml) and both came up the same. Is there somewhere in ACP where you set the sitemap extension?

bzcomputers · May 25, 2025

Alvin63 said:
Quick question about sitemap .xml - I vaguely remember my sitemap is php and presumably it needs to be one or the other? I tried adding both t the browser (php or xml) and both came up the same. Is there somewhere in ACP where you set the sitemap extension?

By default XenForo creates both .php and .xml sitemaps - they include exactly the same content. The .xml sitemap format is the most versatile of supported sitemap formats and is recognized by not only Google but many other search engines and crawlers. Just go with it, its been the standard for a long time and will be universally accepted.

Alvin63 · May 25, 2025

Thanks. My sitemap in Google Search console is Php. I thought I read that I shouldn't have both that and xml or it could cause issues with google. So is it ok to have sitemap php in the robots.txt? Or alternatively change the php version to xml somewhere?

Or could I just leave the sitemap off the robots.txt as it's already in Google search console?

bzcomputers · May 25, 2025

Alvin63 said:
Thanks. My sitemap in Google Search console is Php. I thought I read that I shouldn't have both that and xml or it could cause issues with google. So is it ok to have sitemap php in the robots.txt? Or alternatively change the php version to xml somewhere?

Or could I just leave the sitemap off the robots.txt as it's already in Google search console?

I can not think of any occasion where mixing sitemap extensions (by having a .php sitemap reference in Google search console) would cause an issue. That being said, I personally would delete the .php sitemap file reference in Google search console and replace it with the .xml version. Like most site owners, coders, etc. I like to keep things as organized and as uniform as possible across all platforms if possible.

The sitemap reference should still be in robots.txt also, and as mentioned prior should be the .xml extension file reference.

The sitemap reference in robots.txt is not just used by Google, other user-agents will use to help get around your site. For this reason using the universally recognized .xml sitemap format is a better option. There is no guarantee that all user-agents will recognize a .php sitemap correctly, even though Google and many other user-agents will, it is not guaranteed. You should make your installation as friendly as possible to all allowed bots by using the .xml extension in robots.txt.

Alvin63 · May 25, 2025

Thank you. Just want to make sure I don't mess anything up as there are various things about sitemaps on some threads, and I don't fully understand some things. Eg this thread people having issues with sitemaps and somebody switched back to php

S

Post in thread 'Google Stop Indexing My Site (Help Needed)'

Nov 21, 2024

Mbonu said:
Even this sitemap is not google can not find and index it.

The issue with Google Search Console not finding the sitemap has been a topic for ages. Did you search the forum for it? A lot of people suffer from it, nobody knows the root cause, XenForo say: "not our problem" and "will do no harm anyway".

While I disagree with the first statement I can at least confirm that the second one is possibly true. My sitemap is marked as not recognized by GSC since ages and that did not change no matter what I tried:

Bildschirmfoto 2024-11-21 um 17.18.50.webp

Doesn't stop google from indexing my forum.

It was in an earlier thread I saw the comment not to use both sitemaps - from you

Unless I've misunderstood it!

"

DarkGizmo said:
Also, I have a sitemap.php, not a sitemap.xml file? How would one get the XML file?

There are the same, or I should say have the same data in them. You can use them interchangeably - just don't use both.

Nicolas FR said:
What could be the problem ?

It can show as duplicate submissions in Google Search Console, and can cause a recurring coverage issue alert to be created.

Thread 'Recommended robots.txt'

Sep 23, 2022

Hello,

I've wanted ask, what nowadays is recommended to setup robots.txt for XF2.2 installation.
Any recommendations are very appreciated, thanks!

So am assuming I'd need to remove the php sitemap from google before submitting xml version?

I've been stuck with this for ages! So just left it as it was - google regularly reads the site map in google search console, but it's php. Which is probably why I never got round to doing the robots.txt!

Alvin63 · May 25, 2025

This is what I have in ACP. Heading is XML but later it mentions php. And google seems to be reading it as php.

Alvin63 · May 25, 2025

It was this mention of duplicate url's that had me concerned about submitting an xml sitemap when google search console currently has a php sitemap.

Apologies - as you can see I'm still confused about it

I could do with a noddy guide - how to replace sitemap.php with xml in google search console!

BMajumdar said:
So sitemap.php is not needed ?

Correct. You can actually use either, but using both will just submit duplicate urls.

XF 2.1 Thread 'Sitemap submission in Google Search Console'

May 10, 2020

Hi Everyone

Though on XenForo's Sitemap option settings it claims "Once a sitemap is built, if this option is enabled, the updated version will be automatically submitted to the search engines specified. {$url} is replaced with your sitemap URL automatically",

But still I'd like to know if I want to submit sitemap on Google search console by adding URL, then what should I enter ?

only the "sitemap.php" ? (which is mentioned in the sitemap option + also exists as a physical file in file manager)
the "sitemap.xml" ? (which is neither mentioned in the sitemap option nor exists...

Alvin63 · May 25, 2025

This is what I have showing in google search console for sitemap

Alvin63 · May 25, 2025

Ok so I hope this has worked. I've deleted the php site maps in Google search console and submitted the xml one. Was it that simple?!

So assuming that's done, just doing my robots.txt file. The examples I've seen - like yours above also - have "community" in them. I don't seem to have a /community link for the domain. Can I just leave community out of the sitemap address and disallow list?

Eg instead of

Disallow: /community/whats-new/

Have

Disallow: /whats-new/

Sitemap: https://www.sitename.com/sitemap.xml

(Which is how my sitemap comes up in the browser).

bzcomputers · May 25, 2025

Yes, that's it.

You deleted the 3 .php sitemaps, which were all duplicates of themselves, and added a single reference to the .xml extension sitemap which should also read 1,959 discovered pages. As long as it does you're good.

Robots.txt and sitemap questions

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Common Bots to Consider Disallowing:​

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

Well-known member

XF 2.1 Thread 'Sitemap submission in Google Search Console'

Well-known member

Well-known member

Well-known member

Similar threads

We value your privacy

Common Bots to Consider Disallowing: