What astounds me is there are no readily available libraries crawler authors can reach for to parse robots.txt and meta robots tags, to decide what is allowed, and to work through the arcane and poorly documented priorities between the two robots lists, including what to do when they disagree, which they often do.
Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.
Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.
rel=nofollow is a bad name. It doesn’t actually forbid following the link and doesn’t serve the same purpose as robots.txt.
The problem it was trying to solve was that spammers would add links to their site anywhere that they could, and this would be treated by Google as the page the links were on endorsing the page they linked to as relevant content. rel=nofollow basically means “we do not endorse this link”. The specification makes this more clear:
> By adding rel="nofollow" to a hyperlink, a page indicates that the destination of that hyperlink should not be afforded any additional weight or ranking by user agents which perform link analysis upon web pages (e.g. search engines).
> nofollow is a bad name […] does not mean the same as robots exclusion standards
The "good" bot writers rarely have enough resources to demolish servers blindly, and are generally more careful whether or not you make it easier, so there's not much incentive.
Yes, there's an ancient google reference parser in C++11 (which is undoubtedly handy for that one guy who is writing crawlers in C++), but not a lot for the much more prevalent Python and JavaScript crawler writers who just want to check if a path is ok or not.
Even if bot writers WANT to be good, it's much harder than it should be, particularly when lots of the robots info isn't even in the robots.txt files, it's in the index.html meta tags.