We actually have a surprising number of customers come to us at 80legs wanting to crawl google search results. I don't think most of them are trying to reverse engineer google or anything like that. Most probably just want a fast way to find relevant topics to crawl.
They are disappointed when they learn we obey robots.txt, so we have them manually do searches to pull out seed lists for their 80legs crawls. It's a pain, but there's not really a way around it within the rules.
For the most part "Google" is a condensed version of the web. In theory, you could "spider" a single site (google.com) and build your own search database without having to go out and crawl the web at large.
It's an odd paradox, but I think one search site crawling another search site is not a good idea. And there is probably an infinite spider loop hiding in that process somewhere.
Next time you're looking at a PC infected with malware have a look at the network traffic using a sniffer, chances are pretty good that you'll see searches to google for the weirdest of terms. Apparently this is to get around the limitation that you mention. I'm assuming the results of such searches will be 'mailed home' through some kind of dead-drop.
Sure, but responding to the parent, that's why people don't crawl Google, not the Robots.txt.
And incidentally, some of those searches are to find forms that it can stuff links into. My sites are constantly getting hit by botnets searching for Drupal comment forms. Luckily it uses a quirky URL format that's easy for mod_security to block.
I smell a business opportunity here! I could develop a Digital Robots.txt Management platform that will be built into all server operating systems and would enforce that no spiders are allowed to disregards robots.txt. I'm a genius! Oh, wait...
Though you could end up with a load of URLs, titles, and a summary text, crawling google won't really get you to the important data in building a search engine.
No follow-on links, only a subset of the text, no referrals, etc.
Love the 'paradox' of the infinite spider loop, hadn't considered that before.
Actually a lot of search sites reguarlly pull down large searchs result sets from eachother to do comparisions to their own, though of course this isn't tecnically 'crawling' in the follow every link sense.
I think the goal is to prevent other robots from accidentally crawling the Google corpus. At the end of the day, robots.txt is a stab at comity. There are many well-known services which use crawlers that fail to obey it (or identify themselves). The file merely announces, "This is all recycled content." -- notice that none of the Google corporate pages are disallowed.
Aside: let's say you've whipped up a spiffy new ranking algorithm, and you just need an index to launch your search engine. What's faster: crawling the web, or crawling Google? I don't think such an entrepreneur would pass on a big speed up just because of a text-file.
Since nearly the very beginning of Google proper (aka, not BackRub) they've had a separate search for government sites and for Linux. And also Universities... I vaguely remember both a University of Michigan and a Stanford search. (Larry Page was an undergrad at UMich).
I'd guess over 90% of Google users (and maybe 99%) don't know about "site:" Hell, most people can't even come up with a reasonable definition for "search engine".
They are disappointed when they learn we obey robots.txt, so we have them manually do searches to pull out seed lists for their 80legs crawls. It's a pain, but there's not really a way around it within the rules.