Dumpster Diving for robots.txt Files

< Back to Main

Dumpster Diving for robots.txt Files	3/18/2006 @ 10:09pm
You might be surprised what one can find these days, hiding within obscure web files, such as the robots.txt file. Just for a brief intro, the robots.txt file is used by webmasters to tell the search engines which pages on their site should be ignored. As with most encoded files, the robots.txt file can include comments. The geek in me found it interesting to hit a few popular sites for their robots.txt file, just to see what's there. Check this out: Alexa.com They block all of their search engine colleagues from indexing their own search results. I think that is a little ironic. Although their list of robots is somewhat dated. Webmasterworld.com An entire blog hidden within the robots.txt file? It's like looking at an ezine from the dial-up days. Even more amazing, it appears updated daily. There is even an advertisement banner! We're talking about a robots.txt file here. Google.com A long list of URLs. Some are more interesting than others. At least you can tell what they consider important enough to keep out of the search engines. This one stuck out though: /microsoft What could that be? Last time I checked, those two were strict corporate enemies. Curious to see, I navigated over to the link. I am somewhat confused by the resulting page and even more confused by the title bar: Microsoft - Google Search Isn't that a copyright violation?