> What incentive would Google have to continue populating that index?
Presumably they would still want to run google.com and make money off of it.
> Would I be breaking the law if I independently crawled and hosted an index without publishing an API for it?
No. You would not get the advantage that Google gets when it crawls the web and so would not have access to a large amount of data that nobody else has access to.
Updated based on edit of parent post:
> Maybe this is the problem that needs solving.
Why have websites waste the money to serve all those requests all over again? Why don't we have Google share the results and we can use that money to do more productive things than recreating that work? I don't think website operators would be happy if there were a hundred more crawlers out there crawling as much as Google does now.
Do any site operators actually block non-Google search engine crawlers because being listed DDG/Bing/etc isn't worth the extra cost of serving the crawler? It sound a bit ridiculous unless they actually don't want to be found. Maybe they only allow GoogleBot because that's all they thought of and the extra cost is in researching what all the other search engines call theirs.
Perhaps other search engines should spoof GoogleBot. Browsers have being doing that since forever spoofing Netscape (Mozilla), Safari, etc. for the same reason.
> Why don't we have Google share the results and we can use that money to do more productive things than recreating that work?
This sounds like a common fallacy of people criticizing the free market. Duplicated effort looks wasteful but turns out to be far more productive than the lack of incentive that comes with not being able to profit from your work/investment.
> Do any site operators actually block non-Google search engine crawlers because being listed DDG/Bing/etc isn't worth the extra cost of serving the crawler?
Many website operators do actually block crawlers from non Google search engines and it's because the cost of being crawled isn't worth it to them. Here's a good quote from one such webmaster:
As a webmaster I get a bit tired of constantly having to deal with the startup crawler du jour.
From law firms looking for DMCA violations to verticals search engines, to image aggregators, to company intelligence resellers… It feels to me that everybody and their brother has gotten into spidering sites.
With 10,000s of pages that have content that is only relevant to a targeted audience who is perfectly able to find us on the majors, I do not hesitate to block (and possibly ban) when I see an aggressive crawler that does not provide me or my customers with direct benefits.
> Perhaps other search engines should spoof GoogleBot. Browsers have being doing that since forever spoofing Netscape (Mozilla), Safari, etc. for the same reason.
> This sounds like a common fallacy of people criticizing the free market.
I am asserting that crawling the web is a natural monopoly. This means that the free market has failed and that it is not possible for the market to heal itself in this regard. There is significant evidence that this is the case and I imagine you'll be hearing more and more about it soon.
I would think the site owner’s cost of being indexed is the same for every search engine that indexes the site.
The benefit varies with the quality of the search engines, and that will vary between search engines, but it does get larger the more a search engine is used, so a cost/benefits analysis may show Google and a few other large ones are the only ones worth supporting.
Presumably they would still want to run google.com and make money off of it.
> Would I be breaking the law if I independently crawled and hosted an index without publishing an API for it?
No. You would not get the advantage that Google gets when it crawls the web and so would not have access to a large amount of data that nobody else has access to.
Updated based on edit of parent post:
> Maybe this is the problem that needs solving.
Why have websites waste the money to serve all those requests all over again? Why don't we have Google share the results and we can use that money to do more productive things than recreating that work? I don't think website operators would be happy if there were a hundred more crawlers out there crawling as much as Google does now.