10.27.2003

No White House webcrawling for "Iraq"

UPDATE 10/28: Jim Liedeka writes in to explain:
A robots.txt file is a set of rules that web crawlers are expected to
follow when indexing a site. When google or altavista or others crawl a
web site, they are supposed to read the robots.txt file and do what it
says. The file describes resources which should not be indexed by the
crawler.

There is nothing compelling anyone to do that but it's bad manners to
ignore it. Mostly everyone plays by the rules (except spammers). If I
wanted to point a web crawler at the whitehouse.gov site and ignore the
robots.txt file, there's nothing preventing me from doing so. It's just
a convention that everyone follows.

Usually these files are used to exclude dynamic content which it
wouldn't make sense to index.
(Thanks, Jim.)


Apparently the White House has monkeyed with its official web site to prevent directories containing the word "iraq" from being indexed by search engines. This somewhat tech-heavy (for a tech lightweight like me) page outlines how they did it. Can anyone tell me more about this? (Via Boing Boing.)

No comments:

Post a Comment