Thursday 12 April 2012

robots.txt file of web sites

Have you ever wondered how the search engines such as google.com and bing.com retrieve the web sites when we provide search queries?

Every search engines has a web robots (spider bots), which crawl over the internet and indexes various web sites based on the complex logic. From these indexed set of pages, search engines would retrieve the pages appropriate to our search query.

These spider bots retrieve a web site and traverses across every hyperlink that the website would have and indexes the site content. For example, spider bot of google.com is googlebot and the spider bot of bing.com is bingbot.

These spider bots interacts with the robots.txt file of the website and determine the website pages that are not to be indexed.

Robots.txt file contains the information about various pages of the web sites and specifies the following details:

1. Specifies the robots that web site owner do not/want to index his/her website
2. Specifies the web pages that the user does not want the spider to index

You can access the robots.txt file of various leading websites by as below: http://www.example.com/robots.txt


For example:
http://www.espncricinfo.com/robots.txt
http://www.facebook.com/robots.txt
http://en.wikipedia.org/robots.txt
http://www.youtube.com/robots.txt

If you go through the above robots.txt file, you can observe that these websites specifically block certain spider bots and allows only certain advertisement related bots.

For more information, you can refer the below mentioned website: http://www.robotstxt.org/robotstxt.html


No comments:

Post a Comment