XF 2.1 Google reports "soft 404" for attachment URLs

dethfire

Well-known member
Anyone else seeing this? All 84K of my attachments is marked as "soft 404". Going to the attachment URL shows the attachment just fine.
 
I think it's a Google bug. I received notifications about "soft 404s" on two threads for one of the forums I manage. Both of the URLs loaded quickly and correctly. Neither were in private forums; both were visible to guests.


What is a soft 404?
A soft 404 is a URL that returns a page telling the user that the page does not exist and also a 200-level (success) code. In some cases, it might be a page with little or no content--for example, a sparsely populated or empty page.
Why does it matter?
Returning a success code, rather than 404/410 (not found) or 301 (moved), is a bad practice. A success code tells search engines that there’s a real page at that URL. As a result, the page may be listed in search results, and search engines will continue trying to crawl that non-existent URL instead of spending time crawling your real pages.
What should I do?
  • If your page is no longer available, and has no clear replacement, it should return a 404 (not found) or 410 (Gone) response code. Either code clearly tells both browsers and search engines that the page doesn’t exist. You can also display a custom 404 page to the user, if appropriate: for example, a page containing list of your most popular pages, or a link to your home page.
  • If your page has moved or has a clear replacement, return a 301 (permanent redirect) to redirect the user as appropriate.
  • If you think that your page is incorrectly flagged as a soft 404, use the URL Inspection tool to examine the rendered content and the returned HTTP code. If the rendered page is blank, or nearly blank, it could be that your page references many resources that can't be loaded (images, scripts, and other non-textual elements), which can be interpreted as a soft 404. Reasons that resources can't be loaded include blocked resources (blocked by robots.txt), having too many resources on a page, or slow loading/very large resources. The URL Inspection tool should list which resources could not be loaded, and also show you the rendered live page.
Use the URL Inspection tool to verify whether your URL is actually returning the correct code.
None of this applied in my case. I suspect it just means that something timed out for Google (not likely - the forum is on a fast dedicated server and there have been no recent server outages) or more likely that this is one of numerous bugs that Google has been experiencing in recent months.
 
hmmm well this Google bug is making it so no attachment images are being indexed. The only thing I see is that that header status for image attachments is set to 304. I am using Litespeed server.
 
Google has been experiencing a series of cascading errors in the past couple of months. At one time, they weren't able to index any new content at all for a few days.

But who knows? You can try to "fix" that 304 header response:

 
That's a very bad idea from an SEO standpoint. If a page doesn't exist, the server should correctly report it as 404. Among other things, search engines need that information to drop that URL out of their indices. It's also important for human visitors to know they have an invalid URL: redirecting to the home page would just be confusing ("how did I get here when I was looking for another page?").
 
That's a very bad idea from an SEO standpoint. If a page doesn't exist, the server should correctly report it as 404. Among other things, search engines need that information to drop that URL out of their indices. It's also important for human visitors to know they have an invalid URL: redirecting to the home page would just be confusing ("how did I get here when I was looking for another page?").
the 404 URLs have been removed from the website months ago, i mean, the URLs which give 404 errors on search console dont existed. thats why i wanted to redirect the 404 error pages to homepage.
 
"If a page doesn't exist, the server should correctly report it as 404."

what exactly should i do to report 404 error pages to search console to drop these URLs out of their indices?
 
You can add those pages as error 410 in an .htaccess file:

Code:
Redirect 410 https://yourdomain.com/yourpage.html


What Is a 410 Error?
4xx status codes mean that there was probably an error in the request, which prevented the server from being able to process it. Specifically, a 410 error means “gone.”

In Google’s terms, “the server returns this response when the requested resource has been permanently removed. It is similar to a 404 (Not found) code, but is sometimes used in the place of a 404 for resources that used to exist but no longer do.”

410 vs. 404
410 errors aren’t quite the same as 404 errors, which indicate the page is “not found.” In some cases, 410 status codes are better than 404 status codes because they present more information. By using a temporary custom 410 page, you give the search engine robots the more accurate status and knowledge that the old link should be removed from their crawl index—which can prevent unnecessary traffic.
 
Google still reporting 85K of my image URLs as a soft 404.

I've experienced the same thing in the past. I've even experienced this with one of my Wordpress blogs where I had each post image linked to its own page (a page with only the image on it). I believe that essentially each image posted within a thread in Xenforo is linking to its own "page" with only an image on it. It doesn't seem like a page because to the site visitor, the image opens in a lightbox. Google is considering this a page in and of itself and marking it as empty, hence the "Soft 404" marker in the Google Search Console. When this began happening, I suspected that it wasn't a good thing and I went ahead and blocked the /attachments/ directory in my robots.txt file. I've currently got about 12,000 attachments and I'm now watching them pile up in the Google Console. I have no idea if having blocked URLs is any better than having Soft 404s, but that's what I'm currently testing. In the past, I've found that once a URL is marked as blocked by robots, it'll sit there and then fall out of the index after 90 days. This is what I'm hoping for. My images are still being indexed because they're actually residing in a different directory.

For example, take a look at this structure:

<a href="/forum/attachments/image-1-jpeg.23992/" target="_blank" class="js-lbImage">
<img src="/forum/data/attachments/23/image-1.jpg" alt="IMAGE-1.jpeg" />
</a>

As you'll see, the image "page" resides here:

/forum/attachments/image-1-jpeg.23992/

This is treated as a page by Google and is its own URL.

The image itself resides here:

/forum/data/attachments/23/image-1.jpg

This is what's being indexed by Google Images.

I believe the only way to stop Google from crawling and marking these empty page URLs as Soft 404s is to either block them in your robots.txt file or to set your forum permissions so Unregistered Guests don't have the ability to view them. I did that in the early stages and found that it was suboptimal because Googlebot was spending so much time trying to access those URLs that it wasn't crawling more important ones. When I blocked the /attachments/ directory, I noticed Google leaving those pages alone and crawling the other pages on the site.

Please let me know if this helps.

Jay
 
Be careful. Don't use Notepad to edit .htaccess. Use File Manager from your cPanel or Notepad++.

Excellent advice. Regular Notepad ads some hidden characters to the first line of the .htaccess file and, when re-uploaded to the server, will take your entire site down. I learned that the hard way. Now I edit the .htaccess file directly in cPanel.
 
This robots.txt file content below has been shared here before and is used by many XenForo admins. Adjust as necessary for your installation.
Code:
Sitemap: https://www.yoursite.com/community/sitemap.xml

User-agent: *
Disallow: /community/admin.php
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/goto/
Disallow: /community/login/
Disallow: /community/members/
Disallow: /community/posts/
Disallow: /community/register/
Disallow: /community/search/
Disallow: /community/whats-new/
 
I believe that essentially each image posted within a thread in Xenforo is linking to its own "page" with only an image on it. It doesn't seem like a page because to the site visitor, the image opens in a lightbox. Google is considering this a page in and of itself and marking it as empty, hence the "Soft 404" marker in the Google Search Console.
Good insight and that sounds like a reasonable explanation. XF should not use a lightbox for a direct image resource.
 
I don't think the Lightbox explains it. When you use Inspect google for that direct url, it comes back with a Robots block error for me, even though they aren't robots blocked.
 
Top Bottom