1. This site uses cookies. By continuing to use this site, you are agreeing to our use of cookies. Learn More.

Robots.txt

Discussion in 'Tips and Guides' started by Ryan Kent, Jun 6, 2011.

  1. Ryan Kent

    Ryan Kent Well-Known Member

    If you value SEO (search engine rankings), then your XF site should probably have solid robots.txt settings. I have spent a considerable amount of time learning about SEO and examining detailed crawl logs of my site. Based on my learnings I have updated my robots.txt file as shown below.

    For those who do not know, a robots.txt file tells search engines what parts of the site they can and cannot explore. These are recommendations to the search engine crawlers which can be ignored, but Google and Bing follow them for the most part.

    There are other ways to block content such as the "nofollow" tag, but the robots.txt file is the fastest and easiest way. Before examining my robots.txt file there are a few additional notes:

    - I have XenPorta and all supporting add-ons installed

    - Your robots.txt file is publicly viewable. You would never try to hide private data simply by making a robots.txt entry

    - A primary reason to use the robot text file is to prevent unwanted pages from appearing in search results. If you have a forum discussing Chevy Corvettes then you may wish to block your "off topic" section and other irrelevant pages.

    - A secondary reason to block areas of your site is to keep "junk" out of the search engines. A search engine will usually not crawl your whole site. By eliminating the junk, you help crawlers locate your quality content faster so it gets indexed.

    User-agent: *
    Disallow: /test/
    Disallow: /account/
    Disallow: /admin.php
    Disallow: /ajax/
    Disallow: /conversations/
    Disallow: /events/birthdays/
    Disallow: /events/monthly
    Disallow: /events/weekly
    Disallow: /find-new/
    Disallow: /forums/-/
    Disallow: /forums/tweets/
    Disallow: /goto/
    Disallow: /help/
    Disallow: /login/
    Disallow: /lost-password/
    Disallow: /media/category/
    Disallow: /media/keyword/
    Disallow: /media/user/
    Disallow: /media/service/
    Disallow: /media/submit/
    Disallow: /misc/style?*
    Disallow: /misc/quick-navigation-menu?*
    Disallow: /online/
    Disallow: /pages/conduct/
    Disallow: /pages/privacy/
    Disallow: /posts/
    Disallow: /threads/tera-tweet-from-*
    Disallow: /wiki/special/
    Allow: /

    The above entries explained.

    /test is my backup and testing site. It should not be crawled.
    /-/ is a page for marking all forums as read
    /tweets/ is a forum I use to automatically post all tweets related to my site
    /goto/ is used to goto a specified post
    /media/keyword /user /service /submit are all XenMedio support pages
    /threads/tera-tweet-from-* is a RSS feed that auto-creates a thread for each tweet
    /wiki/special are the wiki support pages

    NOTE: Ideally a robots.txt file is blank. You should let the crawlers access all of your site and control which pages shouldn't be indexed with the "noindex" tag. This method is used because XF doesn't offer the flexibility to easily add the noindex tag to pages on an individual basis.
     
    adwade, NHDriver4, tpoclub and 10 others like this.
  2. Cool

    Cool Active Member

    and what you do with search engines wich totally ignore the robots.txt file
     
  3. Ryan Kent

    Ryan Kent Well-Known Member

    Stats for US search engine traffic are below. Bing is also Yahoo, so there are really only 2 search engines, Google & Bing. They account for over 98% of searches. The others simply do not matter...at all.



    United States:
    Google 84.58%
    Yahoo 8.13%
    Bing 5.38%
    Ask.com 0.79%
    Other
    1.12%
     
  4. DerTobi75

    DerTobi75 Active Member

    For what reason do you disallow Events?
     
  5. Ryan Kent

    Ryan Kent Well-Known Member

    A few reasons:

    - you want your actual site content to be indexed. Your threads, etc. If we create an event, there is usually an event thread created. We want that thread appearing in the search results. If we have both the event and the thread, we would have an issue with duplicate content.

    - if you have events/weekly indexed, your crawl report will be full of errors. Basically the page titles are exactly the same except for the date. The content is often the same or similar, it is just not the type of page you want to return in search results.

    - /monthly and /birthdays are the same idea. Tons of duplicate content, and not helpful as search results.
     
  6. el canadiano

    el canadiano Active Member

    User-agent: *

    Sitemap: http://www.mk3dsforum.com/sitemap

    Disallow: /attachments/
    Disallow: /misc/
    Disallow: /help/
    Disallow: /search/
    Disallow: /members/
    Disallow: /register/
    Disallow: /login/
    Disallow: /online/
    Disallow: /lost-password/
    Disallow: /recent-activity/
    Disallow: /account/
    Disallow: /admin.php
    Disallow: /conversations/
    Disallow: /events/birthdays/
    Disallow: /events/monthly
    Disallow: /events/weekly
    Disallow: /find-new/
    Disallow: /forums/-/
    Disallow: /forums/tweets/
    Disallow: /goto/
    Disallow: /help/
    Disallow: /login/
    Disallow: /media/keyword/
    Disallow: /media/user/
    Disallow: /media/service/
    Disallow: /media/submit/
    Disallow: /misc/style?*
    Disallow: /misc/quick-navigation-menu?*
    Disallow: /online/
    Disallow: /pages/conduct/
    Disallow: /pages/privacy/
    Disallow: /posts/
    Disallow: /threads/tera-tweet-from-*
    Disallow: /wiki/special/
    Allow: /

    Here's mine. I took a lot of your suggestions but I also took into account that I have a sitemap.
     
    iorG19 and Brandon Sheley like this.
  7. bogus

    bogus Guest

    I have read a while through this and quit a lot of other Threads on xenforo and other xenforo based Threads, but i cant get the robots.txt solved for my board.
    On analytics i have quit a lot which is not allowed to bots because of, probably, wrong setup of my robots.txt
    I have xenporta installed, also wiki and a site called idlerpg. Also xenUtiles is installed
    Forum will be redirected to brainlag.eu/forum because of xenPorta

    This is my robots.txt

    And thats my htaccess
    Maybe anyone can help me getting a fitted robots.txt for my site. Many thanks
     
  8. Peggy

    Peggy Well-Known Member

    This is mine. I can't help but feel that there's stuff missing.

    User-agent: *
    Disallow: /test/
    Disallow: /account/
    Disallow: /admin.php
    Disallow: /attachments/
    Disallow: /conversations/
    Disallow: /events/birthdays/
    Disallow: /events/monthly
    Disallow: /events/weekly
    Disallow: /find-new/
    Disallow: /goto/
    Disallow: /help/
    Disallow: /login/
    Disallow: /lost-password/
    Disallow: /members/
    Disallow: /misc/style?*
    Disallow: /misc/quick-navigation-menu?*
    Disallow: /online/
    Disallow: /posts/
    Disallow: /recent-activity/
    Disallow: /register/
    Disallow: /search/
    Allow: /

    Sitemap: http://mahoningvalleytalk.com/sitemap/sitemap.xml.gz
     
  9. Ryan Kent

    Ryan Kent Well-Known Member

    I would recommend using the file I shared above. You would want to remove the following entries:

    /test
    /tweets/
    /threads/tera-tweet-from-*

    Those are customizations for my site. You can leave them as well and they wont do any harm, but it would be best to remove them to avoid any confusion.
     
    bogus likes this.
  10. Ryan Kent

    Ryan Kent Well-Known Member

    Based on inspection of my crawl reports, I have added some additional entries to the original post:

    Disallow: /media/category/
    Disallow: /lost-password/
    Disallow: /ajax/
     
    ArnyVee and el canadiano like this.
  11. bogus

    bogus Guest

    Well. That´s what i have now
     
  12. el canadiano

    el canadiano Active Member

    Added to mine, but why do you want /posts/ disallowed? Doesn't it just redirect to whatever corresponding post and then you're fine with rel="canonical"?
     
  13. Ryan Kent

    Ryan Kent Well-Known Member

    When Google encounters a hash mark in a URL, they stop. The following URL: xenforo.com/community/threads/robots-txt.16735/#post-222029 is seen to google as: xenforo.com/community/threads/robots-txt.16735/

    Therefore the canonical never even becomes a factor.

    The path that is being blocked is /posts/ which is not used for anything that would need to be indexed. For example, you can access this post as http://xenforo.com/posts/222029 but that is never really done. It just adds extra pages for Google and other search engines to crawl. Somehow they see links like that when crawling XF sites.
     
  14. Kaiser

    Kaiser Well-Known Member

    Here is mine:
    http://adminbb.org/robots.txt
    Code:
    User-agent: *
    
    Sitemap: http://adminbb.org/sitemap/
    
    Disallow: /attachments/
    Disallow: /misc/
    Disallow: /help/
    Disallow: /xentrade/
    Disallow: /unanswered/
    Disallow: /ReadPC/
    Disallow: /search/
    Disallow: /members/
    Disallow: /register/
    Disallow: /online/
    Disallow: /lost-password/
    Disallow: /recent-activity/
    Disallow: /account/
    Disallow: /admin.php
    Disallow: /conversations/
    Disallow: /events/birthdays/
    Disallow: /events/monthly
    Disallow: /events/weekly
    Disallow: /find-new/
    Disallow: /forums/-/
    Disallow: /goto/
    Disallow: /help/
    Disallow: /login/
    Disallow: /misc/style?*
    Disallow: /misc/quick-navigation-menu?*
    Disallow: /online/
    Disallow: /posts/
    Disallow: /ajax/
    Allow: /
     
  15. el canadiano

    el canadiano Active Member

    Oh, odd. I always assumed Canonical would have taken care of it.
     
  16. Ryan Kent

    Ryan Kent Well-Known Member

    Canonical would take care of it, but it is not necessary.

    Canonical is designed to help search engines determine which variation of a URL is the primary page you wish to be indexed. Hashtags are truncated and not considered part of the URL so it is a separate discussion.
     
  17. bogus

    bogus Guest

    Probably a robots thing, why i ask it here. How can i force google to show a special text above the URL of the Page?
    ATM the Paypal Donation Text (2nd Link) is displayed but it would be better to have something like...."Welcome to .... Community .....
     
  18. James

    James Well-Known Member

    That would be the thread_list page. The timestamp under the last poster uses the /posts URL. Just about to post a suggestion.

    Also in the forum_list, the latest post thread title uses the /posts/ URL format.
     
  19. Ryan Kent

    Ryan Kent Well-Known Member

    robots.txt is purely about blocking crawler access to your site. If you see any results in Google, then the page is most likely not blocked in robots.txt. It is still possible to have pages listed due to Google following links from other sites to your content.

    Looking at the link you offered, the 2nd result is normal. I don't see any reference to paypal in the top 3 links.
     
  20. Rich

    Rich Active Member

    Just one Question do you add /forum/........like below?

    Disallow: /forum/attachments/
    or is it
    Disallow: /attachments/

    If the forum is in /forum
     

Share This Page