Here is the robots.txt of Google

westside1506 · on July 8, 2009

We actually have a surprising number of customers come to us at 80legs wanting to crawl google search results. I don't think most of them are trying to reverse engineer google or anything like that. Most probably just want a fast way to find relevant topics to crawl.

They are disappointed when they learn we obey robots.txt, so we have them manually do searches to pull out seed lists for their 80legs crawls. It's a pain, but there's not really a way around it within the rules.

jacquesm · on July 8, 2009

I always figured that the company that spiders everybody elses content should have a more relaxed policy towards being spidered itself.

After all, google is datamining the web on an ongoing basis, in return it should willingly consent to being mined in return.

brk · on July 8, 2009

Yes, but...

For the most part "Google" is a condensed version of the web. In theory, you could "spider" a single site (google.com) and build your own search database without having to go out and crawl the web at large.

It's an odd paradox, but I think one search site crawling another search site is not a good idea. And there is probably an infinite spider loop hiding in that process somewhere.

antipax · on July 8, 2009

Not like anyone writing a spider has to obey the robots.txt file.

eli · on July 8, 2009

Sure, and their IP will be blocked by Google some time around the 10th request.

jacquesm · on July 8, 2009

Next time you're looking at a PC infected with malware have a look at the network traffic using a sniffer, chances are pretty good that you'll see searches to google for the weirdest of terms. Apparently this is to get around the limitation that you mention. I'm assuming the results of such searches will be 'mailed home' through some kind of dead-drop.

eli · on July 8, 2009

Sure, but responding to the parent, that's why people don't crawl Google, not the Robots.txt.

And incidentally, some of those searches are to find forms that it can stuff links into. My sites are constantly getting hit by botnets searching for Drupal comment forms. Luckily it uses a quirky URL format that's easy for mod_security to block.

DrJokepu · on July 8, 2009

I smell a business opportunity here! I could develop a Digital Robots.txt Management platform that will be built into all server operating systems and would enforce that no spiders are allowed to disregards robots.txt. I'm a genius! Oh, wait...

stcredzero · on July 8, 2009

And there is probably an infinite spider loop hiding in that process somewhere.

And then, this strange loop attains self awareness... (GEB reference)

pedalpete · on July 8, 2009

Though you could end up with a load of URLs, titles, and a summary text, crawling google won't really get you to the important data in building a search engine. No follow-on links, only a subset of the text, no referrals, etc.

Love the 'paradox' of the infinite spider loop, hadn't considered that before.

anarcticpuffin · on July 8, 2009

Most spiders implement a maximum depth in their search though. So you'd really just get an inefficient crawl, not an infinite one.

pedalpete · on July 9, 2009

Of course, using Google as a seed list more than crawling only google. Not sure why I didn't think of that.

Tichy · on July 8, 2009

Crawling groups would be interesting. What other archives of usenet are there?

jacquesm · on July 8, 2009

also, google had no problem crawling groups when it was still called dejanews.

digitallogic · on July 8, 2009

Actually a lot of search sites reguarlly pull down large searchs result sets from eachother to do comparisions to their own, though of course this isn't tecnically 'crawling' in the follow every link sense.

dschoon · on July 8, 2009

I think the goal is to prevent other robots from accidentally crawling the Google corpus. At the end of the day, robots.txt is a stab at comity. There are many well-known services which use crawlers that fail to obey it (or identify themselves). The file merely announces, "This is all recycled content." -- notice that none of the Google corporate pages are disallowed.

Aside: let's say you've whipped up a spiffy new ranking algorithm, and you just need an index to launch your search engine. What's faster: crawling the web, or crawling Google? I don't think such an entrepreneur would pass on a big speed up just because of a text-file.

twoz · on July 8, 2009

Does Yahoo! not have a robots.txt file?

http://yahoo.com/robots.txt Sorry, the page you requested was not found.

jdrock · on July 8, 2009

If that page doesn't exist, then according to specification, they don't and you can crawl any page.

twoz · on July 8, 2009

Update: Some Yahoo! subdomains have 'em.

Yes:

    http://search.yahoo.com/robots.txt
    http://groups.yahoo.com/robots.txt
    http://realestate.yahoo.com/robots.txt

No:

    http://maps.yahoo.com/robots.txt
    http://omg.yahoo.com/robots.txt

sp332 · on July 8, 2009

What is http://google.com/unclesam ?

Edit: must have just mistyped it, works fine.

jedc · on July 8, 2009

Since nearly the very beginning of Google proper (aka, not BackRub) they've had a separate search for government sites and for Linux. And also Universities... I vaguely remember both a University of Michigan and a Stanford search. (Larry Page was an undergrad at UMich).

eli · on July 8, 2009

try it: http://google.com/unclesam

jacquesm · on July 8, 2009

it searches the .gov domain, because 'uncle sam' employees can't remember to put site:.gov in front of their searches ?

Think of it as a shortcut.

mattmaroon · on July 8, 2009

I'd guess over 90% of Google users (and maybe 99%) don't know about "site:" Hell, most people can't even come up with a reasonable definition for "search engine".