Google Indexing /goto/ links

cmeinck

Well-known member
I noticed a jump in my indexation last week. Despite having /goto/ blocked in robots.txt, Google indexed 20K URLs. I looked here at XenForo and the same has happened.

Searching Google with site:xenforo.com inurl:/goto/ reveals over 6K URLs.

You can remove these using directory removal, but your indexation numbers will be incorrect.

Thoughts?
 
I noticed a jump in my indexation last week. Despite having /goto/ blocked in robots.txt, Google indexed 20K URLs. I looked here at XenForo and the same has happened.

Searching Google with site:xenforo.com inurl:/goto/ reveals over 6K URLs.

Is it of practical relevance though? I don't think it matters in terms of SEO. Notice those /goto/ links were omitted by default in Google search, and if I click to see the omitted results, it says under each goto link that Google was blocked by robots.txt. So Google knows that there are links pointing at /goto/, but it doesn't index them.
 
this xf only has one indexed, despite minimal seo effort:
site:gotvirtual.net inurl:/goto/
i think you all overthink google.
 
Is it of practical relevance though? I don't think it matters in terms of SEO. Notice those /goto/ links were omitted by default in Google search, and if I click to see the omitted results, it says under each goto link that Google was blocked by robots.txt. So Google knows that there are links pointing at /goto/, but it doesn't index them.

If you monitor your indexation, the numbers reported in WMT can be clouded by these URLs. In theory, Google shouldn't index them, but they are indexing them. Does it affect your site from an SEO perspective? Probably not, but it certainly impacts your ability to correctly assess your indexation numbers. I'd prefer to have WMT reports be as close as possible to my actual indexation.
 
In theory, Google shouldn't index them, but they are indexing them.

How so? At least when I do the search you pointed to above, under each omitted Goto link in the Google results, it states, e.g.:

https://xenforo.com/community/goto/post?id=825496
A description for this result is not available because of this site's robots.txt – learn more.

From the Google robots.txt help page (emphasis mine):

While Google won't crawl or index the content blocked by robots.txt, we might still find and index information about disallowed URLs from other places on the web. As a result, the URL address and, potentially, other publicly available information such as anchor text in links to the site can still appear in Google search results. You can stop your URL from appearing in Search results completely by using your robots.txt in combination with other URL blocking methods, such as password-protecting the files on your server, or inserting meta tags into your HTML.
 
My emphasis:

While Google won't crawl or index the content blocked by robots.txt, we might still find and index information about disallowed URLs from other places on the web.

Point being, we control our sites and we should be able to control what's being indexed. They aren't finding these links from other places on the web. If someone were to blog and put a link to /goto/, then I could see it finding its way into the index. Blocking a directory from Googlebot, should prevent widespread indexation.

I agree, these likely have little affect on your site's SEO. I'm just in favor of having an indexation number in WMT that is a true representation of your content.
 
The goto/ handler will just 301 redirect to the correct location (or an appropriate error code if necessary). If you take it out of robots.txt, then you can let Google figure out what to do with them. It doesn't mean they'll disappear though; Google doesn't seem to instantly follow 301 or deindex pages with error statuses.

I just noticed those yesterday. I'm unclear as to where XF uses goto links in its URLs--where are they coming from?
See the arrow on this quote.
 
Wouldn't it make sense to not block these in robots.txt? Technically, Googlebot should follow the 301 redirect to the correct URL (appended with #post). The canonical should prevent indexation of those URLs, correct?
 
It'd be your call. XenForo doesn't ship with a robots.txt and I've never claimed that what we have in ours is "correct" or "ideal". We may run experiments at times as well.

If you don't block it, then it could lead to Google requesting the links unnecessarily, as you know it's never going to lead to actual content. (Note that even when a page 301s, it may still appear in Google temporarily before it follows it through; I've seen it happen with redirects. Thus, the URLs may still appear to be indexed anyway.)
 
If you don't block it, then it could lead to Google requesting the links unnecessarily, as you know it's never going to lead to actual content. (Note that even when a page 301s, it may still appear in Google temporarily before it follows it through; I've seen it happen with redirects. Thus, the URLs may still appear to be indexed anyway.)
Would this apply to links like this:

https://xenforo.com/community/threads/google-indexing-goto-links.85161/post-948960

Links with the "/post-xxxx" at the end of them?

Also with the "/latest" at the end of them?

I'm seeing a lot of these original 301 redirected pages stay in the Google index for years. It's like the redirect never actually canonicalizes in the index.

Thanks.
 
Does anyone know if there's a way to bulk remove indexed goto URLs? Currently, they are blocked in our robots.txt but google search console is throwing a "Indexed, though blocked by Robots.txt" error.
 
Unfortunately, there is no way to bulk remove anything. You can bulk remove the URLs from appearing in the search results, but you can't get them out of the index manually. If they're already blocked in your robots.txt file, they should slowly fall out of the index naturally. How long have they been blocked?
 
Top Bottom