Optimizing Google Crawl Budget

Alfuzzy

Active member
I'm trying to optimize my sites Google crawl budget. My understanding is...items can be added to a sites robots.txt file to prevent Googlebot from crawling/indexing unnecessary items...and thus better utilize the Google crawl budget for the important stuff.

Can some of the experts out there please list what you feel:

  • Should be included in the robots.text file to prevent Googlebot from wasting crawl budget.
  • Should NOT be included in a robots.text file in regards to crawl budget...to make sure Googlebot DOES crawl/index it.

If there are other considerations for optimizing Google crawl budget (other than the robots.txt file)...would really appreciate those insights as well.

Thanks:)
 

djbaxter

Well-known member
robots.txt

Code:
User-agent: *
Disallow: /whats-new/
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /members/
Disallow: /admin.php
Disallow: /tags/
Allow: /
 

Alfuzzy

Active member
Thanks Much DJ.

Looking at what you listed.../tags/...this was one I've been wondering about (and my robots.txt file does not have). I will add it.

I'm curious about disallowing "posts". Is this because "threads" is already crawled...and crawling/indexing "posts" is a double-count of the exact same content?

Or even more than double...since an individual thread could have 10 posts or more...and each post would have a crawl count of one (10 for the whole thread).

Thanks:)
 

Chromaniac

Well-known member
tags would depend upon how properly they are used on your forum. if they are free for all, it makes no sense cluttering search results with them. i recently disabled tags for members and also made them noindex because of the same reason. if you have limited number of tags that are used judiciously, there is no harm in letting them index i suppose.
 

djbaxter

Well-known member
I'm curious about disallowing "posts". Is this because "threads" is already crawled...and crawling/indexing "posts" is a double-count of the exact same content?

Or even more than double...since an individual thread could have 10 posts or more...and each post would have a crawl count of one (10 for the whole thread).

Thanks:)
Yes exactly.
 

Alfuzzy

Active member
robots.txt

Code:
User-agent: *
Disallow: /whats-new/
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /members/
Disallow: /admin.php
Disallow: /tags/
Allow: /

One more question please. What does the Allow: / mean?

Does this mean everything else (anything not Disallowed)...is ok to crawl/index?
 

Alfuzzy

Active member
robots.txt

Code:
User-agent: *
Disallow: /whats-new/
Disallow: /account/
Disallow: /attachments/
Disallow: /goto/
Disallow: /posts/
Disallow: /login/
Disallow: /members/
Disallow: /admin.php
Disallow: /tags/
Allow: /

I modified my sites robots.txt file to reflect the parts mentioned above that were not already in the robots.txt file.

One more question. Should each of the items in the list be a separate sub-directory on my server for my XF install?

For example (if the XF install resides in the public_html directory on my server)...I'm not seeing a public_html/tags/ or public_html/posts/ sub-directory.

I'm familiar with how my old vBulletin 4 server file structure worked...maybe the XF file structure is different.

Just want to be sure the Googlebot is able to find these things properly.

Thanks
 

Alfuzzy

Active member
Good deal thanks. I think I did hear something about this somewhere else...but wanted to be 100% sure.

Is this more like a "virtual" directory system (not actual directories on the server)?

Also...if I wanted to add the sitemap location to my robots.txt file...what would be the proper "sitemap" statement?

Thanks again.:)
 

Alfuzzy

Active member

djbaxter

Well-known member
Do I use


or


The reason why I ask is...there's a difference between what DJBaxter mentioned in post #2...and what member "Mouth" linked in post #10 above.

DJ's robots.txt does not have the "community" part...and the link "Mouth" posted does have the "Community" part.

Thanks
The robots.txt line for the sitemap if your forum is in the root directory is this:

Sitemap: https://{YOURSITE}.com/sitemap.xml

If the forum is in a subdirectory, it would look like this:

Sitemap: https://{YOURSITE}.com/{FORUMDIRECTORY}/sitemap.xml

Using sitemap.php will also work:

ICODE]Sitemap: https://{YOURSITE}.com/sitemap.php[/ICODE]

Sitemap: https://{YOURSITE}.com/{FORUMDIRECTORY}/sitemap.php

The line posted above by @Mouth is the one for Xenforo, where the actual forum is in a subdirectory called "community".

And the line posted by @Chromaniac in post #14 above lacks the Sitemap: for robots.txt.
 

Alfuzzy

Active member
Awesome. Thanks so much for explaining in detail DJ. Also thanks for clarifying what was linked in post #10. I thought that was the actual robots.txt for xenforo.com (here)...and figured the "Community" part was the way Xenforo had things structured/organized (but now I know 100%).

Yes my XF install is in the root directory...and I will use the first statement you mentioned above.

Thanks!:)
 

Ozzy47

Well-known member
I believe because you can use sitemap.xml rather than sitemap.php if you have friendly URLs enabled.
 
Top