Robots.txt and sitemap questions

Thanks. Can I leave the /community out of robots.txt as well? tried putting my site name.com/community/sitemap.xml in browser and just got an oops. The sitemap came up if I just put my sitename.com/sitemap.xml.

Just wondering why others have community/ in them?
 
Yes it still shows 1959 discovered pages. However it says 2.09k pages indexed? (and 47.4k not indexed). I'm aware it's going to take time for them to be indexed again after closing the site down and re-opening though. But how can it index 2.09k if the sitemap only shows 1959 discovered pages? 🤔
 
Thanks. Can I leave the /community out of robots.txt as well? tried putting my site name.com/community/sitemap.xml in browser and just got an oops. The sitemap came up if I just put my sitename.com/sitemap.xml.

Just wondering why others have community/ in them?
Because that’s part of the URL if their forums are in that directory/folder. Like this one is (xenForo.com/community)
 
Yes it still shows 1959 discovered pages. However it says 2.09k pages indexed? (and 47.4k not indexed). I'm aware it's going to take time for them to be indexed again after closing the site down and re-opening though. But how can it index 2.09k if the sitemap only shows 1959 discovered pages? 🤔
The indexed pages will never be exactly the same as your sitemap discovered pages. Especially since you just established a robots.txt.

Prior to the robots.txt being in place the only thing Google could go by on what to index and not index was on page meta tags. Likely you had some pages already indexed, that you are now telling Google not to index. This is all fine and I'm surprised your numbers are as close as they are.

You should only be concerned if indexed pages goes under the count of discovered pages. It can happen though. Google will only index content it feels is valuable.
 
Thanks. I haven't actually done the robots.txt yet. Does this look ok?

User-agent: PetalBot
User-agent: AspiegelBot
User-agent: AhrefsBot
User-agent: SemrushBot
User-agent: DotBot
User-agent: MauiBot
User-agent: MJ12bot
User-agent: ByteDanceBot
User-agent: AnthropicBot
User-agent: ImageSift
Disallow: /

User-agent: Amazonbot
Disallow: /threads/*/reply

User-agent: *
Disallow: /admin.php
Disallow: /whats-new/
Disallow: /account/
Disallow: /attachments/
Disallow: /login/
Disallow: /search/
Disallow: /help/
Disallow: /members/
Disallow: /register/
Disallow: /goto/

Sitemap: https://www.thehamsterforum.com/sitemap.xml
 
Not sure if the syntax is right for Anthropic. Also why do some examples disallow posts? Surely you want google etc indexing threads and posts? Also not sure whether to include PetalBot. It doesn't seem to crop up that much or do any harm on my site?
 
Last edited:
Surely you want google etc indexing threads and posts?
Think of a thread like a whole page. A post is just one part of the page and there is no point in indexing just a post. At least I think that is the reason, however I’m not sure it’s necessary.
 
Your listing of multiple user-agents consecutively before the single disallow directive is recognized by Google, but not all user-agents. To be safe you should be listing each user-agent seperately with its' own disallow directive beneath it.

My example to follow for XenForo specific directives was already posted. Disallowing /posts is correct - if used it would be duplicate content of /threads.
 
Right ok. I ran it through AI as well 🤣 It also said you don't need Allow / at the end. However it also said that the wildcard in the Amazon entry is unlikely to be read/recognised.
 
I just looked up AmazonBot, it appears to gather data for Alexa. If you're not interested in being an Alexa answer it is likely fine to just disallow it.
 
Thank you. I just took it out but I could add it in again. So you mean more like this?

User-agent: PetalBot
Disallow: /

User-agent: AspiegelBot
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: MauiBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: ByteDanceBot
Disallow: /

User-agent: AnthropicBot
Disallow: /

User-agent: ImageSift
Disallow: /

User-agent: *
Disallow: /admin.php
Disallow: /whats-new/
Disallow: /account/
Disallow: /attachments/
Disallow: /login/
Disallow: /search/
Disallow: /help/
Disallow: /members/
Disallow: /register/
Disallow: /goto/

Sitemap: https://www.thehamsterforum.com/sitemap.xml
 
The position of sitemap does not matter top or bottom, but if you do place it under a specific user-agent it will only be used by that user-agent. You can technically apply different sitemaps to different user-agents, although I'm not sure when that would be beneficial.

Correct, I currently don't disallow/block any bots. I run my own servers and they all run at a very low load, so I don't worry about any site performance issues. At the same time if a get a few more legitimate users from some obscure bot scrapping my site for info all the better.
 
Thank you. So it's ok where it is at the bottom then? After everything? I don't think I have any server overload issues but Ahrefs is annoying as there are so many instances of it. Also I thought Bytedance was supposed to be a bit iffy. ImageSift is quite recent but I'm not keen on the idea of it displaying site images elsewhere. Unless they're watermarked. Which they aren't.
 
Yeah, unfortunately the worst offending bots are usually the "bad" bots, that won't follow your robots.txt directives anyway. Gets to be a losing battle quickly. You would need to block them through other means.
 
Still a bit confused about the community / which presumably is the equivalent of forums /?

My sitemap only comes up with mydomain.com/sitemap.xml

If I put mydomain/forums/sitemap.xml nothing comes up.

However "my domain" on its own is an articles page, which I use as a home page. So would any of those extensions like posts, members etc work if disallowed? Without the actual forum bit?
 
Back
Top Bottom