Newsletter ArchiveSUBJECT: robot.txt file The timing on our trip to Boston could not have been better. The weather was perfect and the city was stunning. My thanks to Greg and the people at Volt for inviting us to the show. It was well worth the trip. This week I will cover the robot.txt file. Several of you e-mailed me questions about this file and how it works. When you submit your address to most search engines, they send back a spider to your server to index the pages in your site. When indexing a document, the spider inserts information into its database. Which information the spider inserts depends on the search engine. Some engines index the title, some the first few paragraphs, others index the entire page with all words, and some just index the META tags. The first thing the majority of spiders do is look for the file robot.txt. This file provides them instructions on what pages or files they cannot index. If you have any files or folders you don't want people to find in public search engines, this is where you tell the spiders not to place them in the database. Always remember that any file placed on your web server not placed in a secure file can be read by anybody that can find it. The robot.txt file just keeps it from showing up in the search engines. The spider will basically look for the robots.txt file on your site, where a site is defined as a HTTP server running on a particular host. For example: at the address http://xyz.com the spider will look for the file at http://xyz.com/robot.txt. Also please note that URL's are case sensitive, and the file name "robots.txt" must be all lower-case. Here is a sample from our robot.txt file: # robots.txt for http://www.inter800.com/ User-agent: * Disallow: /5551212/ Disallow: /800data/ Disallow: /cgi-bin/ In this example, three directories are excluded. You need a separate "Disallow" line for every URL prefix you want to exclude. You can't place more than one file in each line. The '*' in the User-agent field has a particular value meaning "any robot". Specially, you cannot have lines like "Disallow: /800data/*" or "Disallow: *.jpg". If you wanted to exclude all spiders from the entire server, then your file would look like this: # robots.txt for http://www.inter800.com/ User-agent: * Disallow: / What you want to exclude depends on your needs. Just remember that everything not explicitly disallowed is considered okay to index. All Contents Copyright ©1995-2001 The Internet 800 Directory Subscribe To The Newsletter: |