• This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn more.

robots.txt question

Ryan Kent

Well-known member
#1
I have an RSS feed which automatically posts in a forum. This forum is used exclusively for this feed. I want to block this forum from being crawled. What is the best method?

I tried adding the path to the node to my robots.txt file and that did not help. I realize the path to all threads is /threads. Can I use a wildcard with robot.txt inline?

What I mean is all the threads begin as:
www.mysite.com/threads/tweet-from-

Can I add www.mysite.com/threads/tweet-from-* to the robots.txt file?
 

Steve F

Well-known member
#2
Have you tried this?

Disallow: /threads/tweet-from-

Edit: Looking at your site I see where you talking about, try

Disallow: /forums/twitter/
 

Ryan Kent

Well-known member
#3
I had that code already. To the best of my knowledge that would block the actual main forum page, but none of the threads contained within the forums. I like the XF URL structure, but this is one of the drawbacks.

/threads/anything.... if not contained within /forums/twitter
 

Steve F

Well-known member
#4
That should ONLY block whats under "/forums/twitter/". This would block the forums completely: Disallow: /forums/

For reference check out the robots.txt here on the XF site.

http://xenforo.com/robots.txt

Code:
User-agent: *
Disallow: /community/find-new/
Disallow: /community/forums/-/
Disallow: /community/account/
Disallow: /community/attachments/
Disallow: /community/goto/
Disallow: /community/posts/
Disallow: /community/login/
Disallow: /community/admin.php
Allow: /
 

Ryan Kent

Well-known member
#8
@Jake, I have taken your suggestion. Outside of the admin forums, our site is open to all. I don't wish to restrict public access to any sections if reasonably possible. So there isn't a means to use a wild card in the thread title like www.mysite.com/threads/tweet-from-*?

@anyone, if I use the below, then all of my attachments would presumably be blocked. At times I do Google Image searches and see files there. I presume those images would then all be blocked for my site?

Disallow: /community/attachments/
 

Brogan

XenForo moderator
Staff member
#9
Including a directory in robots.txt doesn't mean the content is blocked as such, it just prevents crawlers from accessing it.

However, only crawlers which abide by the convention will heed the robots.txt file.
A lot of them don't.
 

Ryan Kent

Well-known member
#10
That's fine. I really only care about Google + Bing. Together they account for most of the traffic. I presume (perhaps wrongly so) any good smaller search engine would follow the same standards. If a search engine decides to go rogue, they probably don't have all that much traffic anyway.
 

Forsaken

Well-known member
#12
Including a directory in robots.txt doesn't mean the content is blocked as such, it just prevents crawlers from accessing it.

However, only crawlers which abide by the convention will heed the robots.txt file.
A lot of them don't.
Robots.txt isn't a method of stopping them from accessing pages, but a way to stop them for telling them not to index them.
 

Ryan Kent

Well-known member
#15
any reason not to block /misc/ ?

I notice the /misc/quic-navigation-menu? URLs, but I am not sure if there are other more useful URLs