robots.txt question

Ryan Kent

Well-known member
I have an RSS feed which automatically posts in a forum. This forum is used exclusively for this feed. I want to block this forum from being crawled. What is the best method?

I tried adding the path to the node to my robots.txt file and that did not help. I realize the path to all threads is /threads. Can I use a wildcard with robot.txt inline?

What I mean is all the threads begin as:
www.mysite.com/threads/tweet-from-

Can I add www.mysite.com/threads/tweet-from-* to the robots.txt file?
 
I had that code already. To the best of my knowledge that would block the actual main forum page, but none of the threads contained within the forums. I like the XF URL structure, but this is one of the drawbacks.

/threads/anything.... if not contained within /forums/twitter
 
That should ONLY block whats under "/forums/twitter/". This would block the forums completely: Disallow: /forums/

For reference check out the robots.txt here on the XF site.

http://xenforo.com/robots.txt

Code:
User-agent: *
Disallow: /community/find-new/
Disallow: /community/forums/-/
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/goto/
Disallow: /community/posts/
Disallow: /community/login/
Disallow: /community/admin.php
Allow: /
 
@Jake, I have taken your suggestion. Outside of the admin forums, our site is open to all. I don't wish to restrict public access to any sections if reasonably possible. So there isn't a means to use a wild card in the thread title like www.mysite.com/threads/tweet-from-*?

@anyone, if I use the below, then all of my attachments would presumably be blocked. At times I do Google Image searches and see files there. I presume those images would then all be blocked for my site?

Disallow: /community/attachments/
 
Including a directory in robots.txt doesn't mean the content is blocked as such, it just prevents crawlers from accessing it.

However, only crawlers which abide by the convention will heed the robots.txt file.
A lot of them don't.
 
That's fine. I really only care about Google + Bing. Together they account for most of the traffic. I presume (perhaps wrongly so) any good smaller search engine would follow the same standards. If a search engine decides to go rogue, they probably don't have all that much traffic anyway.
 
Including a directory in robots.txt doesn't mean the content is blocked as such, it just prevents crawlers from accessing it.

However, only crawlers which abide by the convention will heed the robots.txt file.
A lot of them don't.
Robots.txt isn't a method of stopping them from accessing pages, but a way to stop them for telling them not to index them.
 
Top Bottom