Robots.txt

Ryan Kent · Jun 6, 2011

If you value SEO (search engine rankings), then your XF site should probably have solid robots.txt settings. I have spent a considerable amount of time learning about SEO and examining detailed crawl logs of my site. Based on my learnings I have updated my robots.txt file as shown below.

For those who do not know, a robots.txt file tells search engines what parts of the site they can and cannot explore. These are recommendations to the search engine crawlers which can be ignored, but Google and Bing follow them for the most part.

There are other ways to block content such as the "nofollow" tag, but the robots.txt file is the fastest and easiest way. Before examining my robots.txt file there are a few additional notes:

- I have XenPorta and all supporting add-ons installed

- Your robots.txt file is publicly viewable. You would never try to hide private data simply by making a robots.txt entry

- A primary reason to use the robot text file is to prevent unwanted pages from appearing in search results. If you have a forum discussing Chevy Corvettes then you may wish to block your "off topic" section and other irrelevant pages.

- A secondary reason to block areas of your site is to keep "junk" out of the search engines. A search engine will usually not crawl your whole site. By eliminating the junk, you help crawlers locate your quality content faster so it gets indexed.

User-agent: *
Disallow: /test/
Disallow: /account/
Disallow: /admin.php
Disallow: /ajax/
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /lost-password/
Disallow: /media/category/
Disallow: /media/keyword/
Disallow: /media/user/
Disallow: /media/service/
Disallow: /media/submit/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /pages/conduct/
Disallow: /pages/privacy/
Disallow: /posts/
Disallow: /threads/tera-tweet-from-*
Disallow: /wiki/special/
~~Allow: /~~

The above entries explained.

/test is my backup and testing site. It should not be crawled.
/-/ is a page for marking all forums as read
/tweets/ is a forum I use to automatically post all tweets related to my site
/goto/ is used to goto a specified post
/media/keyword /user /service /submit are all XenMedio support pages
/threads/tera-tweet-from-* is a RSS feed that auto-creates a thread for each tweet
/wiki/special are the wiki support pages

NOTE: Ideally a robots.txt file is blank. You should let the crawlers access all of your site and control which pages shouldn't be indexed with the "noindex" tag. This method is used because XF doesn't offer the flexibility to easily add the noindex tag to pages on an individual basis.

Cool · Jun 6, 2011

and what you do with search engines wich totally ignore the robots.txt file

Ryan Kent · Jun 6, 2011

Stats for US search engine traffic are below. Bing is also Yahoo, so there are really only 2 search engines, Google & Bing. They account for over 98% of searches. The others simply do not matter...at all.

United States:
Google 84.58%
Yahoo 8.13%
Bing 5.38%
Ask.com 0.79%
Other
1.12%

DerTobi75 · Jun 6, 2011

For what reason do you disallow Events?

Ryan Kent · Jun 6, 2011

A few reasons:

- you want your actual site content to be indexed. Your threads, etc. If we create an event, there is usually an event thread created. We want that thread appearing in the search results. If we have both the event and the thread, we would have an issue with duplicate content.

- if you have events/weekly indexed, your crawl report will be full of errors. Basically the page titles are exactly the same except for the date. The content is often the same or similar, it is just not the type of page you want to return in search results.

- /monthly and /birthdays are the same idea. Tons of duplicate content, and not helpful as search results.

el canadiano · Jun 9, 2011

User-agent: *

Sitemap: http://www.mk3dsforum.com/sitemap

Disallow: /attachments/
Disallow: /misc/
Disallow: /help/
Disallow: /search/
Disallow: /members/
Disallow: /register/
Disallow: /login/
Disallow: /online/
Disallow: /lost-password/
Disallow: /recent-activity/
Disallow: /account/
Disallow: /admin.php
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /forums/tweets/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /media/keyword/
Disallow: /media/user/
Disallow: /media/service/
Disallow: /media/submit/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /pages/conduct/
Disallow: /pages/privacy/
Disallow: /posts/
Disallow: /threads/tera-tweet-from-*
Disallow: /wiki/special/
Allow: /

Here's mine. I took a lot of your suggestions but I also took into account that I have a sitemap.

bogus · Jun 9, 2011

I have read a while through this and quit a lot of other Threads on xenforo and other xenforo based Threads, but i cant get the robots.txt solved for my board.
On analytics i have quit a lot which is not allowed to bots because of, probably, wrong setup of my robots.txt
I have xenporta installed, also wiki and a site called idlerpg. Also xenUtiles is installed
Forum will be redirected to brainlag.eu/forum because of xenPorta

This is my robots.txt

And thats my htaccess

# Mod_security can interfere with uploading of content such as attachments. If you
# cannot attach files, remove the "#" from the lines below.
<IfModule mod_security.c>
SecFilterEngine Off
SecFilterScanPOST Off
</IfModule>

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{HTTP_HOST} !www\.brainlag\.eu
RewriteRule (.*) http://www.brainlag.eu/$1 [R=301,L]

# If you are having problems with the rewrite rules, remove the "#" from the
# line that begins "RewriteBase" below. You will also have to change the path
# of the rewrite to reflect the path to your XenForo installation.
#RewriteBase /xenforo

RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -l [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^.*$ - [NC,L]
RewriteRule ^(data|js|styles|install) - [NC,L]
RewriteRule ^.*$ index.php [NC,L]
</IfModule>

Maybe anyone can help me getting a fitted robots.txt for my site. Many thanks

Peggy · Jun 9, 2011

This is mine. I can't help but feel that there's stuff missing.

User-agent: *
Disallow: /test/
Disallow: /account/
Disallow: /admin.php
Disallow: /attachments/
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /lost-password/
Disallow: /members/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /posts/
Disallow: /recent-activity/
Disallow: /register/
Disallow: /search/
Allow: /

Sitemap: http://mahoningvalleytalk.com/sitemap/sitemap.xml.gz

Ryan Kent · Jun 9, 2011

bogus said:
Maybe anyone can help me getting a fitted robots.txt for my site. Many thanks

I would recommend using the file I shared above. You would want to remove the following entries:

/test
/tweets/
/threads/tera-tweet-from-*

Those are customizations for my site. You can leave them as well and they wont do any harm, but it would be best to remove them to avoid any confusion.

Ryan Kent · Jun 9, 2011

Based on inspection of my crawl reports, I have added some additional entries to the original post:

Disallow: /media/category/
Disallow: /lost-password/
Disallow: /ajax/

bogus · Jun 10, 2011

Well. That´s what i have now

http://www.brainlag.eu/sitemap

User-agent: *
Disallow: /account/
Disallow: /admin.php
Disallow: /admindav.php
Disallow: /attachments/
Disallow: /admin.php
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /lost-password/
Disallow: /media/category/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /pages/conduct/
Disallow: /pages/privacy/
Disallow: /posts/
Disallow: /recent-activity/
Disallow: /register/
Disallow: /wiki/special/
Disallow: /ajax/
Disallow: /data/
Disallow: /internal_data/
Disallow: /js/
Disallow: /library/
Disallow: /styles/
Allow: /

el canadiano · Jun 10, 2011

Ryan Kent said:
Based on inspection of my crawl reports, I have added some additional entries to the original post:

Disallow: /media/category/
Disallow: /lost-password/
Disallow: /ajax/

Added to mine, but why do you want /posts/ disallowed? Doesn't it just redirect to whatever corresponding post and then you're fine with rel="canonical"?

Ryan Kent · Jun 10, 2011

When Google encounters a hash mark in a URL, they stop. The following URL: xenforo.com/community/threads/robots-txt.16735/#post-222029 is seen to google as: xenforo.com/community/threads/robots-txt.16735/

Therefore the canonical never even becomes a factor.

The path that is being blocked is /posts/ which is not used for anything that would need to be indexed. For example, you can access this post as http://xenforo.com/posts/222029 but that is never really done. It just adds extra pages for Google and other search engines to crawl. Somehow they see links like that when crawling XF sites.

Kaiser · Jun 11, 2011

Here is mine:
http://adminbb.org/robots.txt

Code:

User-agent: *

Sitemap: http://adminbb.org/sitemap/

Disallow: /attachments/
Disallow: /misc/
Disallow: /help/
Disallow: /xentrade/
Disallow: /unanswered/
Disallow: /ReadPC/
Disallow: /search/
Disallow: /members/
Disallow: /register/
Disallow: /online/
Disallow: /lost-password/
Disallow: /recent-activity/
Disallow: /account/
Disallow: /admin.php
Disallow: /conversations/
Disallow: /events/birthdays/
Disallow: /events/monthly
Disallow: /events/weekly
Disallow: /find-new/
Disallow: /forums/-/
Disallow: /goto/
Disallow: /help/
Disallow: /login/
Disallow: /misc/style?*
Disallow: /misc/quick-navigation-menu?*
Disallow: /online/
Disallow: /posts/
Disallow: /ajax/
Allow: /

el canadiano · Jun 11, 2011

Ryan Kent said:
When Google encounters a hash mark in a URL, they stop. The following URL: xenforo.com/community/threads/robots-txt.16735/#post-222029 is seen to google as: xenforo.com/community/threads/robots-txt.16735/

Therefore the canonical never even becomes a factor.

The path that is being blocked is /posts/ which is not used for anything that would need to be indexed. For example, you can access this post as http://xenforo.com/posts/222029 but that is never really done. It just adds extra pages for Google and other search engines to crawl. Somehow they see links like that when crawling XF sites.

Oh, odd. I always assumed Canonical would have taken care of it.

Ryan Kent · Jun 11, 2011

el canadiano said:
Oh, odd. I always assumed Canonical would have taken care of it.

Canonical would take care of it, but it is not necessary.

Canonical is designed to help search engines determine which variation of a URL is the primary page you wish to be indexed. Hashtags are truncated and not considered part of the URL so it is a separate discussion.

bogus · Jun 14, 2011

Probably a robots thing, why i ask it here. How can i force google to show a special text above the URL of the Page?
ATM the Paypal Donation Text (2nd Link) is displayed but it would be better to have something like...."Welcome to .... Community .....

James · Jun 14, 2011

Ryan Kent said:
Somehow they see links like that when crawling XF sites.

That would be the thread_list page. The timestamp under the last poster uses the /posts URL. Just about to post a suggestion.

Also in the forum_list, the latest post thread title uses the /posts/ URL format.

Ryan Kent · Jun 15, 2011

bogus said:
Probably a robots thing, why i ask it here. How can i force google to show a special text above the URL of the Page?
ATM the Paypal Donation Text (2nd Link) is displayed but it would be better to have something like...."Welcome to .... Community .....

robots.txt is purely about blocking crawler access to your site. If you see any results in Google, then the page is most likely not blocked in robots.txt. It is still possible to have pages listed due to Google following links from other sites to your content.

Looking at the link you offered, the 2nd result is normal. I don't see any reference to paypal in the top 3 links.

Rich · Jul 27, 2011

Just one Question do you add /forum/........like below?

Disallow: /forum/attachments/
or is it
Disallow: /attachments/

If the forum is in /forum

Robots.txt

Well-known member

Active member

Well-known member

Active member

Well-known member

Active member

bogus

Guest

in memoriam 2016

Well-known member

Well-known member

bogus

Guest

Active member

Well-known member

Well-known member

Active member

Well-known member

bogus

Guest

Well-known member

Well-known member

Active member

Similar threads

We value your privacy