Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Here is the robots.txt of Google (google.com)
40 points by coliveira on July 8, 2009 | hide | past | favorite | 33 comments


We actually have a surprising number of customers come to us at 80legs wanting to crawl google search results. I don't think most of them are trying to reverse engineer google or anything like that. Most probably just want a fast way to find relevant topics to crawl.

They are disappointed when they learn we obey robots.txt, so we have them manually do searches to pull out seed lists for their 80legs crawls. It's a pain, but there's not really a way around it within the rules.


I always figured that the company that spiders everybody elses content should have a more relaxed policy towards being spidered itself.

After all, google is datamining the web on an ongoing basis, in return it should willingly consent to being mined in return.


Yes, but...

For the most part "Google" is a condensed version of the web. In theory, you could "spider" a single site (google.com) and build your own search database without having to go out and crawl the web at large.

It's an odd paradox, but I think one search site crawling another search site is not a good idea. And there is probably an infinite spider loop hiding in that process somewhere.


Not like anyone writing a spider has to obey the robots.txt file.


Sure, and their IP will be blocked by Google some time around the 10th request.


Next time you're looking at a PC infected with malware have a look at the network traffic using a sniffer, chances are pretty good that you'll see searches to google for the weirdest of terms. Apparently this is to get around the limitation that you mention. I'm assuming the results of such searches will be 'mailed home' through some kind of dead-drop.


Sure, but responding to the parent, that's why people don't crawl Google, not the Robots.txt.

And incidentally, some of those searches are to find forms that it can stuff links into. My sites are constantly getting hit by botnets searching for Drupal comment forms. Luckily it uses a quirky URL format that's easy for mod_security to block.


I smell a business opportunity here! I could develop a Digital Robots.txt Management platform that will be built into all server operating systems and would enforce that no spiders are allowed to disregards robots.txt. I'm a genius! Oh, wait...


And there is probably an infinite spider loop hiding in that process somewhere.

And then, this strange loop attains self awareness... (GEB reference)


Though you could end up with a load of URLs, titles, and a summary text, crawling google won't really get you to the important data in building a search engine. No follow-on links, only a subset of the text, no referrals, etc.

Love the 'paradox' of the infinite spider loop, hadn't considered that before.


Most spiders implement a maximum depth in their search though. So you'd really just get an inefficient crawl, not an infinite one.


Of course, using Google as a seed list more than crawling only google. Not sure why I didn't think of that.


Crawling groups would be interesting. What other archives of usenet are there?


also, google had no problem crawling groups when it was still called dejanews.


Actually a lot of search sites reguarlly pull down large searchs result sets from eachother to do comparisions to their own, though of course this isn't tecnically 'crawling' in the follow every link sense.


I think the goal is to prevent other robots from accidentally crawling the Google corpus. At the end of the day, robots.txt is a stab at comity. There are many well-known services which use crawlers that fail to obey it (or identify themselves). The file merely announces, "This is all recycled content." -- notice that none of the Google corporate pages are disallowed.

Aside: let's say you've whipped up a spiffy new ranking algorithm, and you just need an index to launch your search engine. What's faster: crawling the web, or crawling Google? I don't think such an entrepreneur would pass on a big speed up just because of a text-file.


Does Yahoo! not have a robots.txt file?

http://yahoo.com/robots.txt Sorry, the page you requested was not found.


If that page doesn't exist, then according to specification, they don't and you can crawl any page.


Update: Some Yahoo! subdomains have 'em.

Yes:

    http://search.yahoo.com/robots.txt
    http://groups.yahoo.com/robots.txt
    http://realestate.yahoo.com/robots.txt
No:

    http://maps.yahoo.com/robots.txt
    http://omg.yahoo.com/robots.txt


What is http://google.com/unclesam ?

Edit: must have just mistyped it, works fine.


Since nearly the very beginning of Google proper (aka, not BackRub) they've had a separate search for government sites and for Linux. And also Universities... I vaguely remember both a University of Michigan and a Stanford search. (Larry Page was an undergrad at UMich).



it searches the .gov domain, because 'uncle sam' employees can't remember to put site:.gov in front of their searches ?

Think of it as a shortcut.


I'd guess over 90% of Google users (and maybe 99%) don't know about "site:" Hell, most people can't even come up with a reasonable definition for "search engine".


this is even weirder.. http://www.google.com/microsoft


I would guess it's the compliment of:

http://www.google.com/linux


Maybe it's Google's way of saying "we're not even gonna go there"


It's not weird--it's very useful for looking up Windows support info or IE dev documentation, e.g..


Looks like the default XP background, no?



What's so interesting about the workforce training fund?


the only thing I noticed that might have been funny, was a robot chicken advertisement for videos :)


Not that this was difficult or anything, but here's a list of all of the links as links. Easier to investigate ;)

http://dpaste.org/TqnU/




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: