We need more alternatives to google, not less. On the one hand, great for blekko to have an exit of sorts, on the other a huge loss for the web. Now it's more or less down to duckduckgo and the likes of gigablast.com (which is now open source).
True, but there are some systemic issues that make that pretty challenging. Google is now paying about $4B a year to people to host (or white label) a search box that goes back to Google. Then if you want to offer advertising on your search you generally have to get it from Google too, but that isn't as valuable as it once was and that is a problem for Google.
But perhaps the most interesting thing I've learned from my 5 years at Blekko doing search is that "phase 2 or 3" of the Internet is here. Crawling everything is a waste of resources because 95% of the "new" stuff coming online on the Internet is not information, it is just spam. This combined with advertiser burnout from getting scammed again and again on advertising that claims to generate leads or sales but leads only to click farms. And the whole eco system of the web is in being rocked, along with media distribution. The world is approaching some sort of climactic shift of orientation.
Blekko's key mission has always been to try to find the needles in this exponentially growing pile of hay. And it is something that the folks at Watson really liked about our technology when we first met at their outreach program to connect with startups. That is what lead to their asking us to join them, and no, they weren't particularly interested in the stuff we had done to provide more topical advertising signals. So as a technologist this has been both a validation of our work on finding the real information in the web and making it useful to people, and, when I'm honest with myself, a welcome step away from working on still more advertising technology.
Having the resources to pursue that, and an engine (Watson) that can put it to use, seems pretty exciting.
The web turned out that way because of google, until google came along the worst we had was on-page keyword spam. Google assigned a value to the links that make up the web and as a result those links were spammed. Hopefully Watson will be able to create a search feature that can not be spammed but so far anything that ends up being a metric that the web pages are rated by will end up being spammed.
Best of luck! At least you're fighting the good fight.
Agreed. There are only a few players with a large web index now. Google, Bing, Yandex, Gigablast, Baidu, Naver.
Others that I can think of are IXQuick (not sure if they use their own index or not) and Yioop (smallish index). There really is a lack of large players indexing the web.
Its also possible that we will see the rise of niche search engines, such as iconfinder.
There are claims that Google indexes 150B pages today. Even if you settle with 1/3rd of this and take in to account that average page size for HTML alone is 200KB, it would be evident that you need 10s of millions of dollars to pay towards hardware. Remember fetching pages costs, storing them, processing them etc costs lots of money and building even semi-scalable index which have in-memory, SSD based layers is super expensive. And we haven't even started the major cost factor i.e. human employees.
Today's search engines are not just indexes of web pages. They give direct answers for "when did Lincoln died", they show detailed street and satellite maps with StreetViews in search results, they act as business directory, they act as people finder, they have elaborate image search (again super expensive to do), they recognize objects in images, they have freshest news search for thousands of sources, they can do video search, they have scans of 100s of thousands of books, they catalog millions of products and so on. Even getting some of this high quality data such as satellite maps, books, business listings, product catalogues etc would cost ~100 million dollars in licensing deals.
Even if you decided to get world's most productive programmers you will need at least 300 people in my most minimal estimation and more than 2-4 years before you can build anything that has non-negligible chance of competing. That's about $75M of cost per year right there at the ultra-minimal end. Of course, this is assuming you already had bigger breakthrough than that of PageRank and that you can beat state of the art machine learning and natural language processing techniques. Hopefully you can now see, for all intent and purposes, search business is closed to attack via startups. I can imagine Facebook and Apple would sooner or later get in to this business but it would be uphill battle for them primarily because of lack of talent that search engine business requires and more importantly lack of data that only Google has for billions of queries and users doing them. You can build AirBnB with smart college hires but building search engines needs truly the cream of the crop who are perfect rare blend of Computer Scientists and part time mathematicians while also being an exceptionally productive applied programmers. Google has been working for years to sweep away pretty much all talent in this area and it would take most competitors significant effort to build this kind of army.
PS: DuckDuckGo is not a "real" search engine. It has a very little of its own index and for most queries they just rearrange results fetched mostly from Bing while inserting links from their own little index. If Bing shuts them out for leaching off of them, they would be toast.
A serious contender to Google isn't going to be a better Google (or even a comparable one), exactly for the reasons you mention. It's going to be something else altogether.
I can't tell you what the answer is, but I am pretty sure that it's outside the box you've built yourself into.
Great write up and break down of why search is hard and how unique a position Google is in. That said searching the entire web as a business isn't very feasible for many companies now and hence most of the other giants competing with Google instead use the asymmetric approach of bringing content into their own garden and building a specialized search engine around it. Since the data in their own garden is structured the amount of work to build an acceptable quality search engine is relatively less. See apple's app store, facebook search, twitter's search, amazon's a9 etc.
Cuil (ha!) donated most of their crawled data to the Internet Archive - so with that plus Common Crawl, it's pretty easy for anyone to have a sizeable index.
What those neither of those give you is an up to date index, which is why small search engines still need Yahoo and Yandex's APIs. I'm not sure any free resource can match the speed at which big companies can index the web.
Common Crawl is more interesting for research purposes than for real search engine business. It refreshes only 4 times a year and has only 2 billion pages. For anything under ~10B pages is most likely to be called "research" or "toy" index. You have almost zero chance of building competitive index if you are below ~15B. Google is rumored to crawl about 150B pages as reference. There are estimated 1.2T pages out there which does not get crawled due to various reasons.
True, you just need a subset. Now how you you identify that subset without indexing the pages to find out whether each page is in the subset you need?
IIRC google used to scan different pages at very different frequencies. Quite possibly because it has assigns pages into subsets every time it indexes.
Perhaps. The problem is that crawling the web isn't that hard. Foreach(links) { get(link)} is all you really need. Doing it multiple times is harder though.
The real problem imho is building a distributed index and fighting spam. Both are incredibly hard problems to do well and very expensive. Hence so few are trying to do it.
I think providing quality results is the challenge. That take certain algorithms and optimizations to do so in a timely fashion. This is where AI projects like Watson really come into play add they are all about retrieving the correct information (a difficult problem on its own) in a timely fashion.
Also true, however I think a basic ranking algorithm should still be fine for most cases if you have 100% spam eradication. Good content left over should bubble to the top.
I tried building a small search engine with a very basic algorithm and it worked very well for 90% of searches.
I agree with you in part but I also believe if you got the others 100% right you would have a respectable engine by itself.
Where did you manage to find their donation numbers from last year? I thought they'd stopped disclosing their donation amounts a few years ago (even then they were in the 6 figures for revenue).
Cannot remember sorry. The numbers were published as 6 donations of $25,000 to different organisations. I got the link from twitter. I think of the groups was a women's interest group if that helps.
It really isn't (been there, done that). The hard part is to operate it. You're going to need quite a bit of hardware (figure about 10 large machines minimum to store an index of any appreciable size) and a ton of bandwidth to do the crawling. Then you're going to have to somehow get an audience large enough to earn enough money to pay for it all. Your runway will be in the 10's of millions of bucks at the size of todays internet.