AD MANAGEMENT

Collapse

BEHOSTED

Collapse

GOOGLE

Collapse

A Beginners Guide to Robots.txt

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • A Beginners Guide to Robots.txt

    Search engines use robots to crawl or spider web pages on the web, these robots or crawlers are nothing else but special programs written for reading web page information including text, links, graphics, headings etc. These crawlers or robots tend to follow a special specification file known as the robots.txt file. For example if a search robot visits a site http://www.seopages.com then it first looks for the robots text file at http://www.seopages.com/robots.txt. If found then the robot follows the instructions in that file is having about how to index that site which pages to read and which not to read. This robots.txt file guides the search robot which part of a website to index and which not to index. The robots specification was developed in 1993 came to be known as the ‘The Robots Exclusion Standard’ and still remains the standard for directing robots with almost all search engines following it. You can learn to define and place a robots file further in this article.

    Basically robots.txt as the file extension implies is just a simple text file without any scripting or programming code in it. It can be created using a simple text editor like notepad and consists of simple text directives. Complex word processors should never be used because their formatting can create problems and lead to removal of the site. Almost every website has certain privileged pages containing sensitive and confidential information that is not intended for general users those pages can be disallowed for reading by search engines with robots file. Robots.txt file can be customized to allow only specific search robots to spider the site, and to disallow reading specific directories or files. Let us create a simple robots.txt file here. Open a simple text editor i.e. notepad write the following lines and save as robots:

    #this is a typical example of robots file
    #comments are placed after hash.

    User-agent: *
    Disallow: /cgi-bin/

    This is a typical example of robots.txt file the User-agent line directive specifies the name of the robot or spider that is visiting the website for example “User-agent: googlebot” specifies Googles robot and the instructions following down will be for that robot. A “ User-agent: * “ value means all robots on the web. Further comes the “Disallow” directive. The disallow directive line specifies the file name or folder name that is to be disallowed to read by that specific robot. Disallow field can be left blank also which will specify that all pages are allowed to spider. Here one care is to be taken in the disallow field that each file to be disallowed should be declared on a new line. In other words multiple files should not be written against single disallow directive. For example for multiple files to be disallowed we will define robots.txt as :

    User-agent: Googlebot
    Disallow: information.html
    Disallow: private.html
    Disallow: shipping.html

    User-agent: Architext
    Disallow: /

    In this example Googlebot is disallowed three pages to crawl and Architext, the spider of Excite, is disallowed all the pages of the site. Similarly all spiders can be instructed if you know their names otherwise use ‘ * ’. However if the file that is to be protected is residing in a folder other than root folder( / ) then complete path of the file can be specified. Now the question arises that where should robots.txt be placed on a website. The answer is root directory( / ) where the index file is placed. Remember that there should always be just one Robots.txt file on a website. Website addresses(URL’s) are case-sensitive, and "robots.txt" string must be all in lower-case and exactly same in name. Blank lines are not permitted within a single record in the "robots.txt" file and there must be exactly one “User-agent” field per record. If robots file is placed in wrong folder then it looses its functionality and spiders ignore it making it useless.

    Advantages of having a Robots.txt

    It helps to hide and protect sensitive and confidential information by disallowing spiders to index them.

    It helps in search engine specific optimization of a website (making web pages for particular search engines).

    This file should be very carefully written according to the format specified before uploading to a website because a simple mistake can result in index removal of a complete website from search engines. Don’t indulge in the activity of making too many copies of web pages to be optimized for every search engine present instead be reasonable with the number and keep the target of the major five or seven engines. So now you know What is a robots.txt file? How to define it? How to use it? and Where to place it?

    Enjoy!

    http://img76.imageshack.us/img76/6450/sc2but4ng.gif | http://img495.imageshack.us/img495/7...gorwtan0je.gif | <a href="http://webtools.sc2.info"><img src="http://img129.imageshack.us/img129/8682/sc2webtan7ju.gif" border="1" width="100" alt="Free Webtools for all" /></a> | <a href="http://www.indexrated.com"><img src="http://img398.imageshack.us/img398/4813/listed1pb.gif" border="1" width="100" alt="Index Rated Directory - Rating Quality Sites" /></a>

    Get all Games,Apps and Wallpapers Nokia, Samsung and Sony Erricson!!


  • #2
    Example to Robots.txt

    Here's an example of a typical robots.txt file I may use:


    User-agent: Mediapartners-Google*
    Disallow:
    User-agent: Googlebot
    Disallow: /*.doc$
    Disallow: /*.PDF$
    Disallow: /*.jpeg$
    Disallow: /*.jpg$
    Disallow: /*.png$
    Disallow: /*.gif$
    Disallow: /*.exe$
    Disallow: /*.mp3$
    Disallow: /*.mid$
    Disallow: /*.wav$
    User-Agent: msnbot
    Disallow: *.doc$
    Disallow: *.PDF$
    Disallow: *.jpeg$
    Disallow: *.jpg$
    Disallow: *.png$
    Disallow: *.gif$
    Disallow: *.exe$
    Disallow: *.mp3$
    Disallow: *.mid$
    Disallow: *.wav$
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Disallow: /guardian/
    Disallow: /axs/
    Disallow: /admin/
    User-agent: Slurp
    Crawl-delay: 60
    User-Agent: msnbot
    Crawl-delay: 60
    This robots.txt file tells Google and MSN not to index certain (e.g. image) files and limits the frequency of hits of the spiders slurp and msnbot (otherwise they can eat up bandwidth).

    This is just an example. Each website is different.

    Here's one robots.txt file validator you can use: http://www.searchengineworld.com/cgi-bin/robotcheck.cgi

    http://img76.imageshack.us/img76/6450/sc2but4ng.gif | http://img495.imageshack.us/img495/7...gorwtan0je.gif | <a href="http://webtools.sc2.info"><img src="http://img129.imageshack.us/img129/8682/sc2webtan7ju.gif" border="1" width="100" alt="Free Webtools for all" /></a> | <a href="http://www.indexrated.com"><img src="http://img398.imageshack.us/img398/4813/listed1pb.gif" border="1" width="100" alt="Index Rated Directory - Rating Quality Sites" /></a>

    Get all Games,Apps and Wallpapers Nokia, Samsung and Sony Erricson!!

    Comment


    • #3
      Re: A Beginners Guide to Robots.txt

      Looks cool..

      Comment


      • #4
        It was great post thanks for sharing with us
        http://www.ideastackhosting.com
        http://www.ideastackhosting.com/vps.html

        Comment

        Unconfigured Ad Widget

        Collapse

        Announcement

        Collapse
        1 of 2 < >

        FreeHostForum Rules and Guidelines

        Webmaster forum - Web Hosting Forum,Domain Name Forum, Web Design Forum, Travel Forum,World Forum, VPS Forum, Reseller Hosting Forum, Free Hosting Forum

        Signature

        Board-wide Policies:

        Do not post links (ads) in posts or threads in non advertising forums.

        Forum Rules
        Posts are to be made in the relevant forum. Users are asked to read the forum descriptions before posting.

        Members should post in a way that is respectful of other users. Flaming or abusing users in any way will not be tolerated and will lead to a warning or will be banned.

        Members are asked to respect the copyright of other users, sites, media, etc.

        Spam is not tolerated here in most circumstances. Users posting spam will be banned. The words and links will be censored.

        The moderating, support and other teams reserve the right to edit or remove any post at any time. The determination of what is construed as indecent, vulgar, spam, etc. as noted in these points is up to Team Members and not users.

        Any text links or images contain popups will be removed or changed.

        Signatures
        Signatures may contain up to four lines

        Text in signatures is subject to the same conditions as posts with respect decency, warez, emoticons, etc.

        Font sizes above 3 are not allowed

        Links are permitted in signatures. Such links may be made to non-Freehostforum material, commercial ventures, etc. Links are included within the text and image limits above. Links to offensive sites may be subject to removal.

        You are allowed ONLY ONE picture(banner) upto 120 pixels in width and 60 pixels in height with a maximum 30kB filesize.

        In combination with a banner/picture you can have ONLY ONE LINE text link.


        Advertising
        Webmaster related advertising is allowed in Webmaster Marketplace section only. Free of charge.

        Shopping related (tangible goods) advertising is allowed in Buy Sell Trade section only. Free of charge.

        No advertising allowed except paid stickies in other sections.

        Please make sure that your post is relevant.


        More to come soon....
        2 of 2 < >

        Advertise at FreeHostForum

        We offer competitive rates and a many kinds of advertising opportunities for both small and large scale campaigns.More and more webmasters find advertising at FreeHostForum.com is a useful way to promote their sites and services. That is why we now have many long-term advertisers.

        At here, we also want to thank you all for your support.

        For more details:
        http://www.freehostforum.com/threads...eHostForum-com

        More ad spots:
        http://www.freehostforum.com/forums/...-FreeHostForum
        See more
        See less
        Working...
        X