Adding Lines to Robots.txt to Block Specific URLs

Alfuzzy

Well-known member
Hello all.

When my site was migrated from vB to XF...some custom member badge images came along for the ride. When I scan the site for errors...these badge image file URLs are showing up quite a bit in the scans. I would like to block them from the Google crawler. A couple other misc URLs thrown in as well I would like to block.

Here are the URLs:



I did some research on how to block things in the robots.txt. Kind of new to this...some of it was confusing. Want to get it right...so I don't end up blocking lots of possibly important stuff.

What lines would I need to add to block each of the 5 URLs above in the sites robots.txt? Just want to block these URLs. If each one requires a separate line in the robots.txt...that would be ok.

Thanks for the help:)
 
Adding entries to robots.txt doesn't prevent the URLs from being accessed.
Some crawlers abide by robots.txt, others don't.

If you want to prevent the URLs being accessed at all, use a rewrite rule in .htaccess (or the equivalent).
 
Yes I heard some crawlers don't obey robots.txt (Google crawler I read is supposed to follow robots.txt). Since Google is the "big-one"...I figured at least blocking Google for these URL's would be a good place to start.

I think working with rewrite rules is a little bit beyond my skills at this point. Robots.txt I understand a bit better...and figured this was at least a good place to start (rather than not doing anything at all). Lol

Thanks:)
 
.I figured at least blocking Google for these URL's would be a good place to start.
Just keep in mind robots.txt doesn't block anything. Think of it as just a note saying "please don't index the following." Also, anything you include in robots.txt is public.
 
Just keep in mind robots.txt doesn't block anything. Think of it as just a note saying "please don't index the following." Also, anything you include in robots.txt is public.
Ohh yes understand all of that (including the public part).:) My understanding Google is supposed to obey robots.txt...and as we all know (at least at the moment)...Google pretty much runs the internet. Lol
 
Here is something to get you started with for .htaccess
Hello Muddy Boots. Thanks very much for the help!:)

  • Do I follow the same format for the other 2 lines I mentioned I wanted to block?
  • Do these lines need to go somewhere special in the .htaccess?
  • I know my .htaccess already has some rewrite rules in it...how do I incorporate these lines without messing anything up (I know the "Rewrite Engine on" part is already definitely there)?
  • I guess maybe I get "rewrite rules" and "redirects" mixed up. How do these rewrite rules "block or hide" things from crawlers?
  • Is there a way I can test how rewrite rules work...to make sure I'm doing things correctly?

Apologies for all the questions...trying to learn a little bit at a time.

Thanks much!:)
 
Ohh yes understand all of that (including the public part).:) My understanding Google is supposed to obey robots.txt...and as we all know (at least at the moment)...Google pretty much runs the internet. Lol
It's considered one of the "noble" crawlers that can adjust according to the robots.txt file. It's not a "supposed to" thing it's more of a "supposedly does" thing. It doesn't always happen. Anyhow, use of the word "block" regarding robots.txt is misleading for the uninitiated.
 
It's considered one of the "noble" crawlers that can adjust according to the robots.txt file. It's not a "supposed to" thing it's more of a "supposedly does" thing. It doesn't always happen. Anyhow, use of the word "block" regarding robots.txt is misleading for the uninitiated.
Appreciate the opinions & clarity of statements for the less IT experienced.:)

Any chance I can get some help with my request in post #1 above please?

Thanks
 
Appreciate the opinions & clarity of statements for the less IT experienced.:)

Any chance I can get some help with my request in post #1 above please?

Thanks
Post numbers 2 and 5 are your huckleberry. Google can help too. Working with .htaccess or its equivalent isn't difficult to self teach. And if the results are wrong or objectionable, it's not permanent. You can't do irreparable harm. You can erase and start over.
 
Can someone help with my request in post #1 please?

Replying via Conversation/PM is perfectly cool too.

Thanks much:)
 
So, to me it is not clear what you want to do. If you don't want those URLs indexed in search engines, posting them here sort of defeats that purpose. You can add them to your robots.txt file but you're still going to find them indexed - maybe not by google itself. It's a dice roll.
 
So, to me it is not clear what you want to do. If you don't want those URLs indexed in search engines, posting them here sort of defeats that purpose. You can add them to your robots.txt file but you're still going to find them indexed - maybe not by google itself. It's a dice roll.
I wanted to add disallow lines to the robots.txt to prevent Google from crawling & indexing the URLs mentioned in post #1.

My apologies if this was unclear. Thanks
 
I wanted to add disallow lines to the robots.txt to prevent Google from crawling & indexing the URLs mentioned in post #1.
I agree with max taxable. Google will crawl and index them via this site unless you can get xenForo to add them to their robots.txt ! Or redirect or delete / edit the post.

I would remove the links from the site, and remove the images from the server
 
Last edited:
I agree with max taxable. Google will crawl and index them via this site unless you can get xenForo to add them to their robots.txt !
Please look at post #1 carefully...the URL is www.example.com. My website is NOT example.com! That's just a placeholder for the actual URLs I'm working with. The format of the examples in post #1 are correct for what I'm working with (example.com is used instead of the actual URL).
I would remove the links from the site, and remove the images from the server
The images are in use on the website (I don't want to delete them). I just don't want Google to crawl them.

Please guys...I appreciate the opinions & extra information...but really not looking for a protracted conversation...just want a solution to my request in post #1.

I'm looking for the proper format to add "disallow" lines to my robots.txt for the URLs listed in post#1.

Thanks:)
 
When you said:

When my site was migrated from vB to XF...some custom member badge images came along for the ride.

That implied you didn't want them, hence I suggested deleting them.

I'm looking for the proper format to add "disallow" lines to my robots.txt for the URLs listed in post#1.
How about:

Code:
User-agent: *
Disallow: /forums/images/

(each disallow on a new line)

or

Code:
User-agent: *
Disallow: /forums/images/red_badge.gif
 
Hello Mr Lucky. Yes...that's pretty much what I was looking for.:)

Pretty easy I guess. Wanted to be 100% certain I got the syntax correct. I did some internet research on this before posting the thread...and some sites seemed to be making it overly complex to add disallow lines to robots.txt for each of the URLs mentioned in post #1.

If I follow the same format you gave...would the following be correct for all 5 URLs mentioned in post #1?:

User-agent: *
Disallow: /forums/images/red_badge.gif
Disallow: /images/badges/Banned.png
Disallow: /forums/images/badges/SeniorMember.png
Disallow: /forums/search.php?do=getdaily
Disallow: /forums/register.php


Also just to be sure...do I only need to add the User-agent: * line once for all 5 disallow lines?

Thanks very much for the help sir!:)

p.s. Here are the URLs from post #1 for easy reference:

https://www.example.com/forums/images/red_badge.gif
https://www.example.com/images/badges/Banned.png
https://www.example.com/forums/images/badges/SeniorMember.png
https://www.example.com/forums/search.php?do=getdaily
https://www.example.com/forums/register.php
 
Top Bottom