Robots.txt and sitemap questions

Alvin63

Well-known member
I've never actually used a robots.txt (didn't get round to it and it hadn't been an issue generally) and have been reading around on here to find various examples and still find it a bit confusing (due to there being different examples) and also not quite understanding some of it. I thought it was just to block robots from scanning the site, but it seems people are adding forum sections to it as well.

Last time I looked into this I thought you needed to add "Allow" at the end to allow google definitively?

Some explanation on it all would be gratefully received, along with a simple example. I currently have these bots crawling (in addition to Google):

Ahrefs, Bing, Petal search, Moz Dotbot- all of whom have been around for a long time, and I'm not aware of any issues related to them, but I've recently had some new ones:

Anthropic, ImageSift, Amazon and Bytedance. No idea where anthropic and imagesift came from or why amazon has suddenly popped up.

Also - is it actually essential to have robots.txt? Also is it essential to have your sitemap at the end as I'm still confused as to whether I have it as php or xml (I think it's php so can I actually put xml in the robots.txt?)

Edit: Occasionally have had Facebook and apple as well.
 
Last edited:
All robots.txt does is point "Good" site bots in the right direction. The file includes instructions on what parts of your website they should crawl and hopefully index. The purpose is to limit "Good" bots from processing unnecessary files without content on your site, in general this should reduce some bot traffic and server load. "Bad" robots are just going to do what the want no matter whether you have a robots.txt file or not.

There are enough "Good" bots out there that including a robots.txt file is important. The chief reason being that Google's bots are "Good" and Google represents over 90% of search engine traffic in the world currently. If your site is popular at all Google bots will be on it multiple times a day.

Your robots.txt file should be setup to disallow all areas of your site that do not have content relevant to index, or content that is not unique (already found elsewhere on your site). This will help improve SEO and saves crawler traffic for the important stuff. You do NOT have to set "Allow" for anything, unless you universally set a "Disallow" blocking all sections of your site to begin with. "Allow" is assumed by default whether you have a robots.txt file or not. So to make the most compact robots.txt file you can just set the specific sections of your site you want to disallow.

For a stock XenForo site an example robots.txt would be something like:
Code:
Sitemap: https://www.YourDomain.com/community/sitemap.xml

User-agent: *
Disallow: /community/admin.php
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/direct-messages/
Disallow: /community/goto/
Disallow: /community/login/
Disallow: /community/lost-password/
Disallow: /community/online/
Disallow: /community/posts/
Disallow: /community/register/
Disallow: /community/search/
Disallow: /community/whats-new/

If you don't want member pages indexed add:
Code:
Disallow: /community/members/

If you aren't using tags add:
Code:
Disallow: /community/tags/

If you're using Google Adsense add this just below your sitemap reference:
Code:
User-agent: Mediapartners-Google
Disallow:

There are no additional changes is you are using Media Gallery or Resource Manager.
 
Thanks. So If I want to get rid of some bots altogether, can I juse name them and disallow everything?
Assuming it is a "Good" bot and it will obey its' instructions, then yes you can set a disallow for the entire website specific to that bot. It would look like:

Code:
User-agent: SpecificBot
Disallow: /

Then keep adding this section to disallow all for each additional bot you want to block.

Common Bots to Consider Disallowing:​

  1. AhrefsBot
    • Why: Used by Ahrefs for SEO analysis and backlink checking. It can heavily crawl your site, consuming bandwidth. Block if you don’t want your site’s data in their database.
    • User-agent: AhrefsBot
  2. SemrushBot
    • Why: Similar to AhrefsBot, used by Semrush for SEO and competitive analysis. It can crawl aggressively, impacting server resources.
    • User-agent: SemrushBot
  3. MJ12bot(Majestic)
    • Why: Crawls for backlink analysis. Known for heavy crawling, which can strain smaller servers.
    • User-agent: MJ12bot
  4. DotBot(Moz)
    • Why: Used by Moz for SEO metrics. Can be resource-intensive, especially for small sites.
    • User-agent: DotBot
  5. Baiduspider
    • Why: Baidu’s crawler (China’s search engine). If your site doesn’t target Chinese audiences, blocking it can reduce unnecessary traffic.
    • User-agent: Baiduspider
  6. YandexBot
    • Why: Yandex’s crawler (Russia’s search engine). Block if your site isn’t relevant to Russian users to save resources.
    • User-agent: YandexBot
  7. ia_archiver(Archive.org’s Wayback Machine)
    • Why: Archives your site for historical records. Block if you don’t want your content archived publicly.
    • User-agent: ia_archiver
  8. CommonCrawl
    • Why: Used for open datasets and research. Can crawl heavily and may not benefit your site directly.
    • User-agent: CCBot
 
The reason I'm asking about allow, is, there is an example on a Google page. Which shows allowing one bot but disallowing all others. Is that an option? To just allow Google and disallow all others? (The example is for Googlebot-news but I assume it could just say Google?). Also am I correct that there is a space before the slash?

User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /
 
Although confusingly, in another example it has "allow" last instead of first

Disallow crawling of an entire site, but allow Mediapartners-Google
This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors on your site.
User-agent: *
Disallow: /

User-agent: Mediapartners-Google
Allow: /
 
The reason I'm asking about allow, is, there is an example on a Google page. Which shows allowing one bot but disallowing all others. Is that an option? To just allow Google and disallow all others? (The example is for Googlebot-news but I assume it could just say Google?). Also am I correct that there is a space before the slash?

User-agent: Googlebot-news
Allow: /

User-agent: *
Disallow: /

It is an option, but I would say a completely wrong option.

Don't over complicate this and hurt your site by setting a Disallow for all. Your going from one bad: No robots.txt at all; to a worse bad by only allowing a single or few robots.

There are many "Good" bots, and many good bots from Google alone. They don't change often but they do change and Google does add new bots on occasion and so do other good bot sources. To name just a few:
  1. Googlebot
    • Purpose: Crawls and indexes websites for Google Search.
    • Why It’s Good: Critical for ranking in Google, driving organic traffic.
    • User-agent: Googlebot
  2. Bingbot
    • Purpose: Crawls and indexes websites for Microsoft’s Bing search engine.
    • Why It’s Good: Increases visibility on Bing, reaching Microsoft ecosystem users.
    • User-agent: Bingbot
  3. DuckDuckBot
    • Purpose: Crawls websites for DuckDuckGo, a privacy-focused search engine.
    • Why It’s Good: Appeals to privacy-conscious users, expanding reach without tracking.
    • User-agent: DuckDuckBot
  4. Slurp(Yahoo)
    • Purpose: Crawls sites for Yahoo Search (powered by Bing).
    • Why It’s Good: Ensures visibility on Yahoo’s niche audience.
    • User-agent: Slurp
  5. Twitterbot
    • Purpose: Crawls pages for Twitter Card previews on X.
    • Why It’s Good: Enhances link previews with images and summaries on X, boosting engagement.
    • User-agent: Twitterbot
  6. Facebot(Facebook)
    • Purpose: Crawls pages for link previews on Facebook.
    • Why It’s Good: Improves content appearance on Facebook, driving social engagement.
    • User-agent: Facebot
  7. Applebot
    • Purpose: Crawls websites for Siri and Spotlight Search.
    • Why It’s Good: Makes content discoverable on Apple devices, reaching a large ecosystem.
    • User-agent: Applebot
  8. LinkedInBot
    • Purpose: Crawls pages for link previews on LinkedIn.
    • Why It’s Good: Enhances professional content sharing, ideal for business sites.
    • User-agent: LinkedInBot
  9. Pinterestbot
    • Purpose: Crawls pages for Pinterest link previews and content discovery.
    • Why It’s Good: Boosts visibility for visual content, driving traffic from Pinterest.
    • User-agent: Pinterestbot
  10. Googlebot-Image
    • Purpose: Crawls images for Google Image Search.
    • Why It’s Good: Drives traffic through image search, ideal for visual content.
    • User-agent: Googlebot-Image
  11. Googlebot-Video
    • Purpose: Crawls videos for Google Video Search.
    • Why It’s Good: Ensures videos are indexed, increasing discoverability.
    • User-agent: Googlebot-Video
  12. Googlebot-News
    • Purpose: Crawls content for Google News.
    • Why It’s Good: Boosts visibility for news content, reaching timely information seekers.
    • User-agent: Googlebot-News
  13. BingPreview
    • Purpose: Captures snapshots for Bing’s search result previews.
    • Why It’s Good: Enhances rich snippets in Bing, improving click-through rates.
    • User-agent: BingPreview
  14. Discordbot
    • Purpose: Crawls pages for link previews in Discord chats.
    • Why It’s Good: Improves content sharing in Discord’s community platform.
    • User-agent: Discordbot
  15. Slackbot
    • Purpose: Crawls pages for link previews in Slack workspaces.
    • Why It’s Good: Ensures clean previews in Slack, useful for professional collaboration.
    • User-agent: Slackbot
  16. WhatsAppBot
    • Purpose: Crawls pages for link previews in WhatsApp messages.
    • Why It’s Good: Enhances content sharing on WhatsApp, a major messaging platform.
    • User-agent: WhatsApp
  17. AdsBot-Google
    • Purpose: Crawls pages to evaluate quality for Google Ads campaigns.
    • Why It’s Good: Ensures landing pages meet Google Ads standards, improving ad performance.
    • User-agent: AdsBot-Google
  18. Google-InspectionTool
    • Purpose: Crawls pages for Google Search Console to inspect URLs and diagnose indexing issues.
    • Why It’s Good: Helps webmasters troubleshoot indexing, ensuring optimal Google performance.
    • User-agent: Google-InspectionTool
  19. Google-Site-Verification
    • Purpose: Verifies site ownership for Google services like Search Console.
    • Why It’s Good: Enables access to Google’s webmaster tools, critical for SEO monitoring.
    • User-agent: Google-Site-Verification
  20. Redditbot
    • Purpose: Crawls pages for link previews on Reddit.
    • Why It’s Good: Enhances content appearance on Reddit, driving engagement from its communities.
    • User-agent: Redditbot
  21. Google-Extended
    • Purpose: Crawls pages for Google’s AI and extended services (e.g., Bard or other AI-driven features).
    • Why It’s Good: Ensures content is available for Google’s AI-powered features, increasing visibility in emerging search formats.
    • User-agent: Google-Extended
  22. TelegramBot
    • Purpose: Crawls pages for link previews in Telegram messages.
    • Why It’s Good: Improves content sharing on Telegram, a popular messaging platform with privacy-focused users.
    • User-agent: TelegramBot
 
Quick question about sitemap .xml - I vaguely remember my sitemap is php and presumably it needs to be one or the other? I tried adding both t the browser (php or xml) and both came up the same. Is there somewhere in ACP where you set the sitemap extension?
 
Quick question about sitemap .xml - I vaguely remember my sitemap is php and presumably it needs to be one or the other? I tried adding both t the browser (php or xml) and both came up the same. Is there somewhere in ACP where you set the sitemap extension?
By default XenForo creates both .php and .xml sitemaps - they include exactly the same content. The .xml sitemap format is the most versatile of supported sitemap formats and is recognized by not only Google but many other search engines and crawlers. Just go with it, its been the standard for a long time and will be universally accepted.
 
Thanks. My sitemap in Google Search console is Php. I thought I read that I shouldn't have both that and xml or it could cause issues with google. So is it ok to have sitemap php in the robots.txt? Or alternatively change the php version to xml somewhere?

Or could I just leave the sitemap off the robots.txt as it's already in Google search console?
 
Thanks. My sitemap in Google Search console is Php. I thought I read that I shouldn't have both that and xml or it could cause issues with google. So is it ok to have sitemap php in the robots.txt? Or alternatively change the php version to xml somewhere?

Or could I just leave the sitemap off the robots.txt as it's already in Google search console?
I can not think of any occasion where mixing sitemap extensions (by having a .php sitemap reference in Google search console) would cause an issue. That being said, I personally would delete the .php sitemap file reference in Google search console and replace it with the .xml version. Like most site owners, coders, etc. I like to keep things as organized and as uniform as possible across all platforms if possible.

The sitemap reference should still be in robots.txt also, and as mentioned prior should be the .xml extension file reference.

The sitemap reference in robots.txt is not just used by Google, other user-agents will use to help get around your site. For this reason using the universally recognized .xml sitemap format is a better option. There is no guarantee that all user-agents will recognize a .php sitemap correctly, even though Google and many other user-agents will, it is not guaranteed. You should make your installation as friendly as possible to all allowed bots by using the .xml extension in robots.txt.
 
Thank you. Just want to make sure I don't mess anything up as there are various things about sitemaps on some threads, and I don't fully understand some things. Eg this thread people having issues with sitemaps and somebody switched back to php


It was in an earlier thread I saw the comment not to use both sitemaps - from you :-) Unless I've misunderstood it!

"
Also, I have a sitemap.php, not a sitemap.xml file? How would one get the XML file?
There are the same, or I should say have the same data in them. You can use them interchangeably - just don't use both.

What could be the problem ?

It can show as duplicate submissions in Google Search Console, and can cause a recurring coverage issue alert to be created.


So am assuming I'd need to remove the php sitemap from google before submitting xml version?

I've been stuck with this for ages! So just left it as it was - google regularly reads the site map in google search console, but it's php. Which is probably why I never got round to doing the robots.txt!
 
Last edited:
It was this mention of duplicate url's that had me concerned about submitting an xml sitemap when google search console currently has a php sitemap.

Apologies - as you can see I'm still confused about it :-)

I could do with a noddy guide - how to replace sitemap.php with xml in google search console!

So sitemap.php is not needed ?
Correct. You can actually use either, but using both will just submit duplicate urls.

 
Ok so I hope this has worked. I've deleted the php site maps in Google search console and submitted the xml one. Was it that simple?! :-)

So assuming that's done, just doing my robots.txt file. The examples I've seen - like yours above also - have "community" in them. I don't seem to have a /community link for the domain. Can I just leave community out of the sitemap address and disallow list?

Eg instead of

Disallow: /community/whats-new/

Have

Disallow: /whats-new/

Sitemap: https://www.sitename.com/sitemap.xml

(Which is how my sitemap comes up in the browser).
 
Last edited:
Yes, that's it.

You deleted the 3 .php sitemaps, which were all duplicates of themselves, and added a single reference to the .xml extension sitemap which should also read 1,959 discovered pages. As long as it does you're good.
 
Back
Top Bottom