PDA

View Full Version : A Beginners Guide to Robots.txt



admans
08-07-2005, 11:25 AM
Search engines use robots to crawl or spider web pages on the web, these robots or crawlers are nothing else but special programs written for reading web page information including text, links, graphics, headings etc. These crawlers or robots tend to follow a special specification file known as the robots.txt file. For example if a search robot visits a site http://www.seopages.com then it first looks for the robots text file at http://www.seopages.com/robots.txt. If found then the robot follows the instructions in that file is having about how to index that site which pages to read and which not to read. This robots.txt file guides the search robot which part of a website to index and which not to index. The robots specification was developed in 1993 came to be known as the ‘The Robots Exclusion Standard’ and still remains the standard for directing robots with almost all search engines following it. You can learn to define and place a robots file further in this article.

Basically robots.txt as the file extension implies is just a simple text file without any scripting or programming code in it. It can be created using a simple text editor like notepad and consists of simple text directives. Complex word processors should never be used because their formatting can create problems and lead to removal of the site. Almost every website has certain privileged pages containing sensitive and confidential information that is not intended for general users those pages can be disallowed for reading by search engines with robots file. Robots.txt file can be customized to allow only specific search robots to spider the site, and to disallow reading specific directories or files. Let us create a simple robots.txt file here. Open a simple text editor i.e. notepad write the following lines and save as robots:

#this is a typical example of robots file
#comments are placed after hash.

User-agent: *
Disallow: /cgi-bin/

This is a typical example of robots.txt file the User-agent line directive specifies the name of the robot or spider that is visiting the website for example “User-agent: googlebot” specifies Googles robot and the instructions following down will be for that robot. A “ User-agent: * “ value means all robots on the web. Further comes the “Disallow” directive. The disallow directive line specifies the file name or folder name that is to be disallowed to read by that specific robot. Disallow field can be left blank also which will specify that all pages are allowed to spider. Here one care is to be taken in the disallow field that each file to be disallowed should be declared on a new line. In other words multiple files should not be written against single disallow directive. For example for multiple files to be disallowed we will define robots.txt as :

User-agent: Googlebot
Disallow: information.html
Disallow: private.html
Disallow: shipping.html

User-agent: Architext
Disallow: /

In this example Googlebot is disallowed three pages to crawl and Architext, the spider of Excite, is disallowed all the pages of the site. Similarly all spiders can be instructed if you know their names otherwise use ‘ * ’. However if the file that is to be protected is residing in a folder other than root folder( / ) then complete path of the file can be specified. Now the question arises that where should robots.txt be placed on a website. The answer is root directory( / ) where the index file is placed. Remember that there should always be just one Robots.txt file on a website. Website addresses(URL’s) are case-sensitive, and "robots.txt" string must be all in lower-case and exactly same in name. Blank lines are not permitted within a single record in the "robots.txt" file and there must be exactly one “User-agent” field per record. If robots file is placed in wrong folder then it looses its functionality and spiders ignore it making it useless.

Advantages of having a Robots.txt

It helps to hide and protect sensitive and confidential information by disallowing spiders to index them.

It helps in search engine specific optimization of a website (making web pages for particular search engines).

This file should be very carefully written according to the format specified before uploading to a website because a simple mistake can result in index removal of a complete website from search engines. Don’t indulge in the activity of making too many copies of web pages to be optimized for every search engine present instead be reasonable with the number and keep the target of the major five or seven engines. So now you know What is a robots.txt file? How to define it? How to use it? and Where to place it?

Enjoy!

admans
08-07-2005, 11:28 AM
Here's an example of a typical robots.txt file I may use:




User-agent: Mediapartners-Google*
Disallow:
User-agent: Googlebot
Disallow: /*.doc$
Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.jpg$
Disallow: /*.png$
Disallow: /*.gif$
Disallow: /*.exe$
Disallow: /*.mp3$
Disallow: /*.mid$
Disallow: /*.wav$
User-Agent: msnbot
Disallow: *.doc$
Disallow: *.PDF$
Disallow: *.jpeg$
Disallow: *.jpg$
Disallow: *.png$
Disallow: *.gif$
Disallow: *.exe$
Disallow: *.mp3$
Disallow: *.mid$
Disallow: *.wav$
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /guardian/
Disallow: /axs/
Disallow: /admin/
User-agent: Slurp
Crawl-delay: 60
User-Agent: msnbot
Crawl-delay: 60


This robots.txt file tells Google and MSN not to index certain (e.g. image) files and limits the frequency of hits of the spiders slurp and msnbot (otherwise they can eat up bandwidth).

This is just an example. Each website is different.

Here's one robots.txt file validator you can use: http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

Positive
08-08-2005, 00:08 AM
Looks cool..

james0131
02-01-2014, 07:37 AM
It was great post thanks for sharing with us