Optimizing Google Crawl Budget

frm

Well-known member
Didn't have any idea about a crawl budget. Just updated from /login/ and /register/ to the suggested... maybe it'll crawl less and rank higher in the coming weeks to months?
 

Alfuzzy

Active member
Was working on my robots.txt today...specifically the sitemap statement. XF is installed in the server root directory. There's a sitemap.php file in the server root directory...but it's dated from about 3 months ago. I'm guessing this is not the correct sitemap file...maybe an artifact of some sort when the site was migrated from vBulletin to XF a couple months ago.

I then did a server search for "sitemap"...and I see the most recent sitemaps seem to be in the directory:

public_html/internal_data/sitemaps/

In this directory there are 6 files:

sitemap-2000345678-1.xml.gz
sitemap-2000345678-2.xml.gz
sitemap-2000345678-3.xml.gz
sitemap-2000345678-4.xml.gz
sitemap-2000345678-5.xml.gz
sitemap-2000345678-6.xml.gz

The dates of these files are from yesterday (9-16-20)...which corresponds correctly with what I'm seeing in the AdminCP.

Based on info mentioned earlier in this thread...I thought my robots.txt file sitemap statement would be:

Sitemap: https://www.example.com/sitemap.xml

Based on the info above...is this still true...or should the robots.txt sitemap statement be something different?

Thanks
 

djbaxter

Well-known member
sitemap.xml is fine and it's what I think @Chris D recommended recently in another thread.

What you're seeing is just the gzipped file. That's the .gz at the end of the filenames.
 

Alfuzzy

Active member
Are you guys saying that although the actual sitemap files are in the directory:

public_html/internal_data/sitemaps/

...that the robots.txt sitemap statement of:

Sitemap: https://www.example.com/sitemap.xml

Will get the job done?

Thanks

p.s. Obviously "example.com" will be replaced with the actual URL.:)
 

Alfuzzy

Active member
Awesome...thanks again. That's a great double-check!

Would this sitemap URL also be the exact same URL I would submit within the Google Search Console?

p.s. Just tested the sitemap URL...and it works great!:)
 

Alfuzzy

Active member
Excellent. One more quick question regarding this sitemap stuff. Here's a screenshot of my XF AdminCP sitemap settings.

Any suggestions regarding what could be added...changed...deleted? Main goal is to optimize Google crawl budget...not have anything there that's wasted effort/wasted resources. Thanks

2nd question...Does it effect/hurt the Google Crawl budget if there are more than one submitted sitemap?


sitemap settings.png
 

djbaxter

Well-known member
I don't think it hurts anything.

However, uncheck Tag and User in that list. All you need are Nodes and Threads. The rest is just noise.
 

djbaxter

Well-known member
All you need is

Code:
https://www.google.com/webmasters/tools/ping?sitemap={$url}
https://www.bing.com/ping?sitemap={$url}
 

Alfuzzy

Active member
I don't think it hurts anything.

However, uncheck Tag and User in that list. All you need are Nodes and Threads. The rest is just noise.
Excellent...thanks DJ.:)

I thought based on what you recommended earlier in the thread...that unchecking tag & user might be the way to go.
 

Mr Lucky

Well-known member
I'm trying to optimize my sites Google crawl budget. My understanding is...items can be added to a sites robots.txt file to prevent Googlebot from crawling/indexing unnecessary items...and thus better utilize the Google crawl budget for the important stuff.

What exactly is the crawl budget?

NB: my understanding robots.txt while it stops crawling ny Google, doesn't stop it indexing.
 

djbaxter

Well-known member
Google can't index pages it doesn't know about. Crawling is how Googlebot finds your pages. Since that is limited, you don't want googlebot wasting time crawling or trying to crawl pages you don't want to be indexed.



Google doesn’t always spider every page on a site instantly. In fact, sometimes, it can take weeks. This might get in the way of your SEO efforts. Your newly optimized landing page might not get indexed. At that point, it’s time to optimize your crawl budget. We’ll discuss what a ‘crawl budget’ is and what you can do to optimize it in this article.

Crawl budget is the number of pages Google will crawl on your site on any given day. This number varies slightly from day to day, but overall, it’s relatively stable. Google might crawl 6 pages on your site each day, it might crawl 5,000 pages, it might even crawl 4,000,000 pages every single day. The number of pages Google crawls, your ‘budget’, is generally determined by the size of your site, the ‘health’ of your site (how many errors Google encounters) and the number of links to your site.
 
Top