>Google makes $40bn...If I can create something that just a tiny fraction of people find useful, then I can create a huge amount of value.
You conflate two meanings of value: monetary value, and intrinsic value. Search engines are intrinsically but not monetarily valuable to users. Search engines are monetarily, but not intrinsically, valuable to advertisers. You can get into trouble when you conflate these two meaning of "value".
In fact, right here is the pivot on which the internet goes from an idealistic shang-ri-la for geeks, to a commercial hellscape for the unwashed masses. It is surprisingly easy to create intrinsic value with computers! You see it all day every day on HN: some geek had a thought, spends a weekend making it, and then deploys a solution.
It is surprisingly hard to extract monetary value from an intrinsically valuable solution. In fact, I believe that creating artificial scarcity is the hardest part of building an internet business, requiring invention on par with the intrinsically valuable part - and yet its the very thing that idealists rail against.
(And making something artificially scarce does seem morally repugnant. And yet I don't see any other way to pay developers. Full stop. Open source software + consulting fees is a good way to go, but that can't apply to hosted search for the public. Well I guess it could, you could teach businesses how to game your own engine!)
> Search engines are intrinsically but not monetarily valuable to users. Search engines are monetarily, but not intrinsically, valuable to advertisers.
I”d like to offer a different view, one that thousands of subscribers at Kagi search can hopefully stand behind too. (Kagi founder here)
Searching has a monetary value and a cost, the only question is whether the user is paying for it or a third party is paying for the user (a cringy thought when you think about it).
This question is answered by the very business model of the search engine, which determines who its customer is. It can either be its users (like in the case of Kagi and mwmbl) or advertisers (like most other search engine).
Although it is really hard to break through the habit of getting search for "free", we are at least happy to be able to offer this choice to the consumer today. I am expecting to see many more paid search engines in the future.
As someone who has been architecting a paid “deep search” tool for a couple of years now ( my approach is trying to enable very deep search flexibility, like you can grip files on your hard drive, but balancing that against usability at internet scale is very hard ) I have been on board with the idea the “search is valuable enough to end users that they will pay” for awhile. The issue isn’t value, it’s marketing it successfully enough to gather users at a rate that covers costs. I haven’t launched my project even in private alpha to anyone because it still costs too much per search, for other search engines it will cost less and hopefully we get a healthy ecosystem of search tools eventually.
Based on what I’m seeing in google search quality… their days of unquestioned dominance died a few years ago. The king is dead, the prince (bing) is on vacation somewhere, and the crown is up for grabs.
I have no idea what gp means by this. It can be surprisingly hard, but it's not always. Go buy a cooler of water bottles and sell them for 50c in the summer on the side of the street
It seems pretty obvious in an internet context. It's very difficult to make money with a product if the competition is giving their products away for free because it gets paid in a different way.
Even your "sell water on a hot day" idea probably won't sell a lot if you set up shop right next to an enormous promo stand from a global bottled water company that gives away bottled water for free. (And due to the magic of the internet, every spot you pick is right next to a global competitor with deep pockets)
No, even then it is the same concept. The big global bottled company thought that the intrinsic value of the water is more than 50c or maybe more than 50$ . Now you have bunch of people carrying their water bottles around, giving them even more advertising. People think that water bottle company is what I should get based on those carrying it around. So infact, they are giving away water bottle for free, which compared to competitors might be 50c or 50$. but they think it is worth losing that for ad
Ok your explanation makes sense but I don't see how what you said related to the single qualifier presented: being intrinsically valuable, and gp states it as if it were an accepted rule of thumb for all economics, not specific to the internet.
It is. Capitalism relies on the basic system of supply and demand. When the marginal supply is free - the only proper price is zero - anything else isn't pure Capitalism. And so as free would mean no incentives, we give an even worse of a hack to deal with it, ownership.
A better way to deal with this is to discover what the intrinsic value for the product is, allow the creator to give it away, and then subsidize the creator the value they generated.
The nice thing with this system is that we can transition to it really easily. There are already many Open Source projects that exist, all we need to do is ear-mark a certain sized pot, figure out the weights of existing products, and hand it out. And as the economy gets stronger, allow the pot to grow, and eventually there might not be a thing as closed software, as open software will always be able to generate more value than closed.
There exist grant programs at most institutions designed to support such work and precisely for that reason. Sadly, writing grant applications are a job in-and-of-themselves. Consider the case of Justine Tunney, the creator of redbean. If anyone deserves a grant or three, it's her. But she's too busy making cool shit to a) research which grants are available, and b) satisfy their onerous application requirements.
Obviously the problem is that those in charge of giving away money are very passive, and in fact put up large barriers (the onerous application process) for anyone wanting the money. This, to me, is absurd. The admins of such a fund should themselves be active in one or more areas of interest, such that they themselves should be approaching people like Justine and telling her, "Here, please take my money."
I don't know much about grants, but like most things that are broken I suspect there are perverse incentives here. For example, someone is probably investing that grant money, and if they give it away the money leaves the market, and that will make the admins sad. (I don't know if that's the case, but that's the kind of thing I would expect to see.)
Oh, another surprisingly effective thing people are doing are subscriptions to people you like. You see this particularly on twitch, and sometimes on youtube. People can and will give money to the creators they like! It's kind of amazing to me. I suppose the analog for people like Justine is Patreon. It's fascinating though that people will happily subscribe to an entertaining personality, but will happily ignore someone who toils at making incredible things behind closed doors (but still use their tools). The obvious solution is that all open source projects should become a source of entertainment.
Yep, you have outlined the problems with grants. As they look forward, and due so in a one-time lump sum, they hold considerable risk, and thus the providers of the grant money need to ensure they pick correctly, but this just leaves us with a centrally controlled economy and bureaucratic nightmare.
Instead, you should seek to discover the intrinsic value as it is occurring, and therefore no risk to the granters. I believe it is possible to do so through a modified Vickrey auction. A Vickrey auction is one in which the winners pays the price the second place bidder placed. In this modified system, the top 90%(or some other optimizing number we can work out) of bidders win, and they all pay the price of the highest non-winner. A certain subset users of the Open Source software are asked to give their true value for some sub-set of open source products as they use it and using statistics we could extrapolate that to the rest of the population.
We would then use these weights to give away all grant money. This might mean that Firefox can be funded exclusively through this pot, and they would no longer be beholden to Google for default Search. It would mean the grant writing process would go away. It would mean, as ads make products less valuable, that ads go away.
So, fund people based on github stars. I could think of worse solutions, honestly!
Oh, something else I remembered: Princeton's Institute for Advanced Study! This is an interesting idea. Basically they tell a scientist "You've changed human history, the least we can do is give you an income for the rest of your life that has no strings attached. You can teach, or not, research, or not." It would be cool if each of FAANG had one, or they pooled together to make one, or had a virtual distributed version. "Dear Fabrice Bellard, we will be depositing 10k euros a month into your account until you die. Yours, Google."
I would also add that human attention is inherently scarce. Advertising businesses like google and facebook leverage this fact.
Google provides unlimited searches to users without charge. On the other side, google provides human attention to advertisers, and charges for it due to the scarce nature of attention. Sort of like a marketplace business.
I don't think artificial scarcity makes sense as a mental model here. A better way to make money is probably to just find some way where people need you. When you contribute your work to the commons through open source, it earns you a lot of love and admiration. However people don't need you. It's kind of the whole point. The biggest technological services in our society that are the most crucially needed, were all built on open source. Since the whole point of open source is to give you the power and opportunity to build things you can control yourself.
The people who build stuff with open source are very smart, so they usually don't need foss devs as consultants to explain to them how to do it. I make some of the most popular projects on this website. Been doing it for years. I get plenty of donations because my work makes people so happy. But no one has ever offered to pay me for something in return, since I don't have anything with any economic value, other than me myself. So I get plenty of job offers from people who would love to be able to say I'm their employee. Since controlling people is about as valuable as controlling the services people need.
It'd be nice if our cultural mythology about earning your keep doing an honest day's work through fair trade was how the system worked, rather than me needing to depend on the gift economy that's funded by control. But I think there was just so much abuse of the classic models of human cooperation that they just couldn't transition to the digital era, and as such we need to find a way to adapt.
IMHO only, as an idealistic aside: It would be lovely to see a society in which programming was done as civil service, in a program similar to Americorps or Peace Corps or what the old CCC did: provide public infrastructure using public funds.
Programming reminds me of other professions with high inputs but low per-unit-cost outputs: teaching, music, movies, art, journalism, etc. -- basically anything that is create-once, share-often. All those (for the most part) are things that America's shortsightedly capitalist economy fail to adequately incentivize/reward unless you happen to become a celebrity.
There is already, and will continue to be, a class of human labor output that is intrinsically valuable but which our economy is unable to adequately price. I'd argue that's more an issue with the speculation-driven economy that we have than with the labor in question. It'll only get more drastic we we automate more and more and further amplify human creativity.
It'd be cool to see a nonprofit search engine/email service/office suite/whatever funded in a similar style as NPR, but of course that'd run into political issues at every level.
In an utopian cyberpunk future, what if there were multiple voluntary nonprofit "shadow governments" that you can choose to tithe every month, almost like churches? You can choose between e-governments red, blue, green, yellow, purple, gray, whatever... give them 1% of your income a year in exchange for a suite of services run and staffed by professionals who are salaried but own no equity; they work as a form of civil service, not a wealth-building scheme. Syndicates for the public good, I suppose. Lol, in all the video games, these usually turn into private military companies with killer androids, but what if they just, uh, provided really good email (and automatic online driver's license renewals) instead?
Such systems would probably never be able to attract the best talent (unless they turn into something like Mozilla, which is a big enterprise masquerading as a nonprofit), but you often don't NEED the best. Wikipedia, the sum total of human knowledge, is also the sum total of human mediocrity (with plenty of redundant efforts, infighting, unoptimized problems, etc.). But having stable collaborative communities is something that is critical for producing works of intrinsic value, and developing slow-trickle funding streams for such communities -- the kind that can sustain without turning them into potential get-rich-quick schemes, or subject them to violent boom-bust cycles -- is what allows them to both keep working effectively, AND keeps away the exploitative get-rich-quick types looking to subvert public labor for personal gain.
We need a funding model that provides enough income to attract people who want to do it for the public good, but not so much that it also attracts people who want to turn it into personal wealth. In high-input, low-per-unit-cost services, optimizing for maximum (as opposed to sufficient) monetization will too often mean that the users themselves become the product, as we see time and again with Google, Facebook, Twitter, and basically the entire modern online economy. It's the difference between Amazon and your local library, Oracle and Postgres, EvilOS and Linux, etc.
Things don't have to be artificially scarce if we can learn to ask, "How can I make this as widely accessible as possible while ensuring I have my own basics needs covered?" instead of "How can I make this as monetarily valuable as possible?"
There are always going to be people who want to just do things to make the world a better place, just as surely as there are always going to be people who want to optimize for personal profit. We have every system to fund the latter, but not so much the former right now. Is the for-profit business model the only, or best, way?
Perhaps a better approach would be building an open source www index or even a full current cache - as an enabler for people to build their own search engines?
Right now it is extremely difficult to build your own web crawler that would compete with Google. And that is not because of the technology, but because multiple sites will prevent your bot from accessing them if you're not Google or Bing - either through robots.txt, or through directly banning your IP if it's trying to crawl and it's not a confirmed google-bot.
Having a non-profit, open source, crawler that keeps an up to date index (or web cache) of the web would help competition spring up.
Isn't that one more data sets for ML other research purposes instead of a highly up to date search index (For example with news from a few minutes ago).
Roughly speaking, yep - Common Crawl provides a sizable chunk of web data (420 TiB uncompressed, over 3 billion unique URLs, as of May 2022; historic statistics here[1]), and is updated on monthly basis. Not near-real-time, true, albeit relatively fresh.
A question to ask could be: how often do users care about information from a few minutes ago, compared to information that has been available for a longer duration of time?
I mean, any time someone wants information on current or recent events is your use case right there. If you exclude news entirely, you could maybe disregard recent websites but I imagine that's statistically a pretty large portion of search.
I built something with an API that uses Selenium to image a site. It works on a large percentage of the sites I feed it.
I don't recursively call links found in the pages. I expect the user to give me the URLs to crawl and save.
In order to "find" new content, I let the user specify where they want to search for things the engine hasn't "crawled" yet. So, a search for scooters to buy might end up searching Amazon directly, then lets the user "save" the site by passing the Amazon URL for the scooter they like to the system for imaging.
I use GPT-3 or other ML models to do some of the heavy lifting for adding labels to the pages or documents the user uploads.
This ends up being a "curated" list of documents important to the individual user, not an exhaustive crawl of all things which are important to all users.
> Perhaps a better approach would be building an open source www index or even a full current cache - as an enabler for people to build their own search engines?
That's a excellent idea! In the spirit of open-data, and people can do with it what they want.
I think this is a great idea. How does this work with copyright? Search engines seem to be able to download a reproduce content from scraped pages (and wrap it in ads, and derive content from it) this is called “indexing” when they do it but scraping when everyone else does it.
E.g. on Google if you search for "how to tie a tie", a little info box may pop up with step by step instructions. This content is taken from some website, but that website gets no page hits or ad revenue. Instead, Google gets to serve ads on the search engine results page.
(I don't know if this happens for this specific example, but Google does this for some searches)
Part of why sites participate in the infobox program is that in practice you do get quite a lot of hits from it: many people click through to see the answer in context.
I think they're referring to how Google "extracts" answers from your website and shows it on the search results page. Effectively meaning that the user doesn't even need to go to your site to get the answer, because Google extracted it and gave it to them directly.
It seems to me that what they usually extract is some junk only vaguely related to the query and often cut apart and reassembled in a way that's just wrong.
I'm curious if you think a co-op would be feasible between Mwmbl and other like-minded crawlers who are interested in taking a divide-and-conquer approach to crawling the web.
Disclaimer: I work for Google, though far away from Search.
Regardless of search engine design, there's HUGE money in SEO. Any successful search engine will be gamed. Do you have the developer power to go red-queen against all the large companies in the world?
For clarification, “red queen” means a conflict between two or more entities where the cost of engagement grows, but the relative advantage does not change.
Simply put, search engines have been at war with SEO for over 30-years, which has significantly raised the bar not only being a search engine, but producing content; not to mention knowing how to search for information. With the introduction of machine generated content, information wars between countries, global dependence of online commerce & information, etc — the speed of change shows no signs of letting up.
In my opinion, for the average person, knowing how to search for information is the real issue, not that the quality of information available has become worse or that Google has become a worse search engine. If anything, Google has reduced its advantage search capabilities not for financial gain, but because average user is just too lazy to learn how to search and keep up with changes required to continue to be an advanced searcher.
To your last point, I don’t quite agree. Google’s incentives are misaligned such that keeping you on Google.com just a little longer is better than not because you are more likely to click on an ad.
But also yes the users and the UI both fail. When I used to search for something I would type in something like “gutter clog clean” but slowly started noticing that Google likes longer sentences like “how do I clean a clog in my gutters?”. In pursuit of making Knowsmore (from Ralph Breaks the Internet), Google lost the power user features. Search would be infinitely better if they actually fucking respected literal mode and stopped trying to treat me like an idiot with no attention span. Having search results that contain one out of like 8 words in my query and asking me if I want to include others and then when I say I do still showing me results without them is broken UI and not a user problem.
Google search, from very early on, considered it a success metric when users went quickly to a result. I have no idea how that factors into the current surely hideously complex ranking algorithms, though.
As for the parsing of queries, that's probably based on how most users use search. Not everyone is familiar with keyword -based search. I expect they've done tons of A/B tests to determine what kind of query interpretation makes most users get better results. We're just not "most users".
>Search would be infinitely better if they actually fucking respected literal mode and stopped trying to treat me like an idiot with no attention span. Having search results that contain one out of like 8 words in my query and asking me if I want to include others and then when I say I do still showing me results without them is broken UI and not a user problem.
Agree verbatim in this specific situation is likely answer, though did not point it out since they clearly think they understand how to search; verbatim has been an option as long as non-verbatim search has been used by Google.
Beyond that, complaining Google does not do XYZ misses the point. Google is a search engine designed for the average user and the average user does not want verbatim search. They also do not want: advanced search operators, true Boolean search, regular expressions, API access to search, open source code, real-time streams of pages Google’s crawling, etc.
What they do want and always have is natural language based searches in there language of preference with clarifying responses from the search engine in natural language; that is, they want to treat a search engine like a person and be treated like a person; which was odd that they referenced Knowsmore, since Knowsmore [1] used keyword based searches, not plain language searches.
Google is not the primary problem, the average user is the issue. Unless people realize that — they’re fighting in a war they do not even understand.
To make it even more clear, Google is easily able to detect and block users blocking ADs, but they do not. More than 60% of users still don’t block ADs; not because they love ADs, but because effort to figure it out simply is not worth it to them, they like ADs, etc.
>What they do want and always have is natural language based searches in there language of preference with clarifying responses from the search engine in natural language; that is, they want to treat a search engine like a person and be treated like a person
I agree with you but Google is not yet at that point where it can act and serve people like an Answer Machine that knows everything; both the people's preferences and the perfect answers.
>Google is not the primary problem, the average user is the issue. Unless people realize that — they’re fighting in a war they do not even understand.
Again I agree that casual users are the problem but how we can help them? This is the The Innovator's Dilemma[0] where if we ask casual users what new stuff they want from Google Search, they will answer "nothing". Because even they themselves don't know how their UX can be or should be improved and on top of that they are satisfied with Google's mediocrity. They would just respond "Google is Google".
>Beyond that, complaining Google does not do XYZ misses the point. Google is a search engine designed for the average user and the average user does not want verbatim search. They also do not want: advanced search operators, true Boolean search, regular expressions, API access to search, open source code, real-time streams of pages Google’s crawling, etc.
Complexity of constructing "complex" search queries needs to be simplified so casual users can use such features and queries.
>To your last point, I don’t quite agree. Google’s incentives are misaligned such that keeping you on Google.com just a little longer is better than not because you are more likely to click on an ad.
I agree, which leads me to the conclusion that subscription is the best way to avoid this conflict of interest. Unfortunately, most of the world won't subscribe to a search engine, and doesn't seem to mind ads - to a degree. With Google looking more and more like AltaVista before its demise (to Google), my conclusion is that Google will strangle itself out of existence and give way for the next "new, streamlined, not-full-of-ads" competitor.
In the 20-30 searches that I do in a day, I still have to google about half of them. Either because it's stuff Google does well (currency conversion, for example), or Kagi just doesn't get what I'm trying to search.
I remember starting out with the Internet searching on Altavista and Yahoo and Lycos. The information that was present was nowhere near as now, and it was more "exploratory". Nowadays people just kind of know what they want and just wants to quickly get there.
> In the 20-30 searches that I do in a day, I still have to google about half of them. Either because it's stuff Google does well (currency conversion, for example), or Kagi just doesn't get what I'm trying to search.
Currency conversion is not technically a search. It is question answering and Kagi capabilities are still being built. Google only has a 20 year headstart. Can you report all such cases to kagifeedback.org so they are on our radar?
Thanks, currency conversion was an example off the top of my head only. I am active on orionfeedback and kagifeedback, I find that they're really prompt and effective in answering to feedback.
The other examples are a bit harder to describe and I can't quite describe how Google gets it right. I think I might need more time to describe it out, as it involves search in another language.
Currency conversion is nothing you have to sell yourself to Google for. Just bookmark a bank, a financial or an academic research site that seems trustworthy. I have used the same ones for over 20 years, probably found them using Altavista at the time...
What about one funded by universities or libraries as a research project?
There have been lots of no ads (for now) attempts. DDG had like one small ad at one point. But people didn’t leave in droves. It’s almost like people are ok with ads.
Allow users to blacklist sites.
Share blacklisted sites.
Have the option of instead of hiding blacklisted sites entirely - show them in a different column.
99%+ of people in the medical industry did not sanitize hands or equipment 100 years ago.
99%+ of people in the tech industry currently do not care to do the extra steps required for data neutrality, and privacy.
99%+ of people are lazy to the point of harming themselves and others.
1%- of people examine how the 99%+ do things and pioneer harm reduction tactics in spite of everyone constantly reminding them that no one wants their help.
Why should we care about 99% of people? They are people who upload all their personal data to social networks agreeing to "we can do whatever we want" terms, they pay with a card, use Apple's and Microsoft's spyware (which marketing people call "telemetry") ridden operating systems, and install Chrome that sends a signal to Google every time they open a new tab and sends data about every form on every website they visit (which Google developers call "crowdsourcing" in the code [1]).
Make a customizable and privacy-respecting search for us, power users.
Also, I have noticed that Firefox internal search works good for sites you had visited. So when I want to visit a page I have seen earlier, I can go straight to it skipping Google.
Also, you can click on any search box and add it with a prefix, so that you can search MDN or Wikipedia directly, again, without informing Google.
A quick, entirely unrepresentative look at the user count for uBlock Origin and uBlacklist in Firefox make me somewhat less optimistic than you are. Or is there a more popular way of blocking sites from search results than uBlacklist out there, which I simply don't know about?
More and more search engines are now giving you the option to customize search results. Brave has Goggles, You.com has thumbs up / down icon and other alternative search engines have similiar capabilites too. I have been really enjoying the ability to tailor my search to how I like it and e.g. rank reddit (although it seems like it is not liked by some folks here) higher.
It could work like spam filtering. Once a certain number of users have marked a site as spammy (or put it on their blacklist), it gets downranked for everyone. But jstummbillig is right. It could easily get gamed.
I love how this is a problem that a huge number of people have worked on for around 20 years and have thrown stupid amounts of money at and people just jump in and boldly proclaim that they have the answers that just popped into their mind. Peak HackerNews.
I don’t think these ideas just popped into his mind. They’ve been discussed for a long time on HN, and some alternative search engines have started to incorporate them.
Google have actually followed a similar model for like ten years now with like 10,000 temp workers following this 172 manual to essentially as the basis to train a lot of their ranking models.
Gmail's spam filter is basically community trained by people clicking spam/not spam (at least, that was what they said many years ago, it might have silently changed). What's different about search?
That’s the same question as “how do you ensure complete trust” which is, of course, not possible. That doesn’t mean that “distributed trust-based blacklisting” still isn’t better than what Google is offering today, which is nothing.
Some people couldn't find their butt with both hands and a map. Best not to take advice from them. Unless you find yourself surrounded. In which case smile and nod.
I think Google search is using the wrong approach. When I'm looking for e.g. a new camera, I want to use my network which I trust. E.g. I want to ask "what camera would HN recommend?" We should think more about how we can use trust as a basis for how we explore the internet.
Given that enough people use this heuristic, there will be companies focusing on earning karma on HN, writing comments and voting for products they are getting paid for.
While true, it is also not a given that you'd need to trust _all_ of HN. I visit sufficiently regularly that I see some people post and recognize their username. I often think -- I wish I could _follow_ them. Not so much of a stretch to think of rings of trust built around particular users. Bringing people in and kicking them out of these trust circles could play a role. PageRank -> TrustRank? Of course it would also be only one metric, among many possible trust rankings and many possible other signals and settings.
I bet the (niche) product as such wouldn't be as hard to build as it would be to scale. Imagine every user constantly tweaking (directly or indirectly) their search result settings, and having that impact millions (or more) indexed items, for every user.
Wouldn't the world become a little bit better if they did? Earning karma on HN is not the easiest thing. I would hate to speak for all of us but collectively, don't you think we have a pretty good marketspeak alert system here?
I think you are kind of ignoring the issue that parent brought up.
The hn machine is not all that smart and can easily be gamed, and would, if the stakes were to become high enough. Farming hn karma by making pointed statements on crowd favorites (parent named two, oss licensing or privacy also spring to mind) gets you your 10k, no originality or honesty required, in no time.
What's protecting hn is a lot of moderation + relative irrelevance. If those 10k were to systematically bring you enough eyes (by driving search results), you are in effect printing money. There is no reason to assume the number of people doing it would not scale with the return attached to doing it.
Anything (points/karma/coins) which is free & unlimited will find it's way to get exploited. "What if" there's a barter system for karma/points, you do a +1 and get a -1.
Though I do not know how the initial allotment of karma/points could be distributed for the pioneers and for the new growing community, maybe allot 'n' points for new user after a year...
Trust can work both ways. They'd have to be really careful to not lose the earned trust. On the way, they would have to write a lot of high quality HN posts. I'd say this model is a win over current spamming and fake reviewing practices.
The other side of them having to write a lot of high quality HN posts is me having to read and evaluate more HN posts trying to game me. That is work, and if I sense a lot of it (like I did on Reddit), I will leave, and I suspect others would too.
Playing defense is exhausting when playing offense is extremely cheap.
Personally I don't ask my network or friends for options on such things because people tend to have a positive bias towards things we have invested money in.
Some approach like this could still work, but it’s incredibly hard to maintain/define the right “network”, and across different domains. (E.g. HN probably not so good a community for latest fashions or sports trivia)
I put your search into SearX with a lot of the major engines enabled (Google, Bing, Qwant, Brave, DDG, etc.). Arguably, Google did a better job giving me HN results (but it's a very small sample set).
Edit: if you're not familiar with SearX, everyone who visits will get a slightly different result based on dynamic results. Even if the same person refreshes a few times the exact results and ordering will vary; I've learned to try the same search a few times to get better results, it's just a quirk of how it works and how each remote engine reacts at that given instant.
Google's end goal is Answer Machine but we are decades away from that. Answer Machine would be something like God-like software which gives you the ultimate answer to your query and that answer would be 100% accurate, true and personalized/suited for you. Again blackbox solution but it will be so advance with the help of AI that it would be trustworthy by default/design. Hard to achieve but looking at the Moore's Law and the advance of AI it will eventually be achieved.
> Do you have the developer power to go red-queen against all the large companies in the world?
Let's try? It's also an interesting research topic in itself and might be a topic for academic research. At the moment Google is a black box and their incentives are not really aligned to stop SEO. It's good for them to show more ads, it's good for them to show you copycat pages of github/stackoverflow with ads. Not saying that Google is doing this on purpose - I doubt it - but we don't know. It's surely possible to create an index and ranking that prefers different things than Google.
Let's try something at least. It will be gamed, will be worse probably but it's open and can be a playground for academic research.
Best that can happen is that there are ways to for a better ranking and Google was dishonest to maximize profit. If it's still gamed at least the mechanics can be studied and analysed and maybe someone can figure out a switch like 'be unfair in ways Google can't' - would love a 'no ads on the page' switch that would probably solve quite a few problems.
Trust our black-box and you won't have enough devs for this is just a bad answer for such an important problem. The amount of stupid simple redirection spam in my results in the last few years also looks like that Google just doesn't care alot about this anymore.
I think this is mostly a problem if you are in Google's position, near total market dominance.
This not only means you exert massive selection pressure on the shape of websites. The SEO spammers don't need to be good or know what they are doing, they just need to be lucky once. If they get it right, they float to the top, and can iterate on that design. This effectively is saying that no matter how secret or smart your algorithms are, it doesn't matter if you're in Google's position. The numbers are stacked against you.
To make matters worse, any company with that sort of a market share has serious handcuffs in how heavy handed and "unfair" you can be without risking litigation for anti-competitive practices.
I think the best thing that could happen for Google is ironically serious competition in the search market. This would help both problems at once.
Only because it's profitable for them to allow it to be gamed, like all the spam sites now when you search for SO, Google allows them to be ranked because they're filled with Google ads. But it'd be trivial to just delist them all, that'd be beneficial for the user but not the search engine.
It's not a matter of 'developer power' just flip a boolean somewhere and delist the site.
Actually, it is entire possible to create search criteria that cannot be gamed. That is what Google Research's various arms ought to be working on. But we all know if is far more profitable to have a manipulatable system.
Could you expand on this a bit more? I don't see any obvious way to develop non-manipulable search criteria, I was suspecting that there might even be an impossibility theorem about this (which would depend a lot on the exact formulation, though), and I'd like to know what you have in mind.
Because of the filter bubble highlighted by Eli Pariser, there is massive opportunities for SEO companies to trick their customers into thinking they have got high up in the SE results, when all they are seeing is their filter bubble!
And that money goes to support many industries beyond search itself. The author really needs to get off a computer for a minute and understand the economics of the web as it stands today and how the free ad supported model supports millions of peoples livelihoods before jumping to "This is all bullshit"
While I agree that they may need to better consider the economics - both of the engine and the websites that may SEO it- that doesn’t mean we should just assume as supported is the way to go or the only way to support people. The economy of today looks different than 20 years ago or 20 years before that. Doesn’t mean we shouldn’t grow and change.
I respect that. I guess my point is "Grow and change with a decent understanding of what the present state enables"
A search engine like Google isn't just a search engine as the author describes it. It is a very integral part of the economy of the internet and just labelling a simplistic interpretation of the present state as "Evil" with an academically poor write up of what a viable alternative is does little good.
I want to write articles and read articles written for me by others. Ideally as few as possible should profit from this process.
As google is now a turd, not just no longer capable of delivering this service but actively destroying the good part of the web by refusing to index it.
It is my attention, it doesn't belong to anyone else.
My access to information and educated opinion is a far more integral part of the greater economy.
Google is like a screaming man at a town meeting making sure no one else can get a word in. The meeting is now pointless.
Google is a catalogue. There is no physical analogy that comes close.
Your point about wanting to read articles written for you by others is certainly possible. The very fact that such a desirable outcome drives you to Google and nowhere else should suggest the complexity of the problem they’re solving and how there isn’t really anything else out there doing so well.
>Ideally as few as possible should profit from this process.
Why?
When you accepted a job offer in the software industry, did you stipulate that your mission is to write code for your employer and you will be charging as little as possible for that privilege? Minimum wage should get you by just fine, right?
I hate fully grown adults behaving as though anyone except them making a profit is somehow evil.
> The very fact that such a desirable outcome drives you to Google and nowhere else should suggest the complexity of the problem they’re solving and how there isn’t really anything else out there doing so well.
Yes and I'm not impressed.
> > Ideally as few as possible should profit from this process.
> Why?
> When you accepted a job offer in the software industry, did you stipulate that your mission is to write code for your employer and you will be charging as little as possible for that privilege? Minimum wage should get you by just fine, right?
I'm not a good example as I indeed live wonderfully on minimum wage and write software for free.
> I hate fully grown adults behaving as though anyone except them making a profit is somehow evil.
Don't worry, my philosophy is not that superficial. We have people who make things, people who organize the making of things and people who organize the things made.
It can be true that the meta data is more valuable than the data it self and organizing an effort can be much more intense than any of the tasks involved. But lets not pretend that is always the case.
Before money and before the written word we had the exchange of thoughts, observations and ideas. I believe this to be somewhat like the foundation on which everything else we did is build. I want to see this process benefit from technology.
You wrote your comment perhaps a bit limited by the ropes of the platform but sincerely, free from any agenda, you wrote pretty much what you think.
Now if we [beyond HN] add additional layers of agendas between our exchange, each interested in maximizing their profit from it perhaps not you but many others will resort to self-moderation.
You wont be able to state it simply like: "I hate fully grown adults behaving as though anyone except them making a profit is somehow evil."
It could become something like "I don't understand why some people don't like others making money" stripped from how strong you feel about the subject. You could also chose not to say anything.
At that point we are messing with the very fabric of our collective reality.
If I had to chose between freely communicating and the economy it wouldn't be a hard choice.
Lying in a privacy policy and breaching privacy-related regulations can also be fraud and illegal. Think about why there's so much pushback against the GDPR despite it only primarily mandating transparency with regards to data usage (if they were doing things above-board why would they be afraid?).
> Instead of looking at how long people spend on a site, we would encourage users to give explicit feedback on rankings and use this to improve our ranking system.
While they're not wrong about how the way Google determines ranking has its issues, this way has its own set of problems. If you explicitly use user ratings as part of your rankings in some way, people can punish sites they don't like, ala review bombing on Yelp, Steam, etc.
Not saying it's necessarily a bad idea because of that, but I hope they don't fall victim to the mentality of, "let's just trust the users" as an ironclad rule, because that doesn't always work out well.
Bombing is one thing, but you also have a whole SEO industry now that will exploit any way possible to get to the top of the rankings.
The moment you have community rankings on search, and your search gets popular, you land in a war zone with bots trying to mangle those. Reddit is kind of good dealing with that, but it is very resource intensive.
What if you limit accounts to real people and then keep track of their credibility? It's some initial effort but how could the ranking be manipulated when all dishonest people have burned their credibility?
I've always wondered if you could combat via referal-only sites. To get in you need to achir your humanness on someone else's account. If an account is found to invite too many spammers, robots, or otherwise it is banned or disallowed to invite more accounts.
I'm sure you could still manage to make "fake" accounts but it would be much more difficult, and linking them together would be much easier.
Of course starting a site like this would be very difficult. But maybe you could start without it then add it in once you get to a decent popularity such that many people can find a referral if they need to.
Of course it is a much smaller site so it isn't clear how effective this strategy would be at a large scale. Even a referral-based site approaching HN levels would be very interesting to see.
In China, all social accounts must be associated with a phone number, and phone numbers are tied to government identities. It doesn't stop any manipulation of scores and rankings.
> then keep track of their credibility?
It is very likely China will do that too soon. I think you can already imagine the ramifications.
You can’t limit to real people. If you managed to, I would make a service where people can sign up and I’ll pay to use their account. No cost to them and they make some money from it, seems totally reasonable.
That's just another group of people with reduced credibility. At some point, the price to incentivize the next person to offer their account is more expensive than the benefits of link manipulation. Every corrupted account can be discovered because the manipulated content stands out and will be reported.
I went back to Reddit a week back after a 1 or so year, and what I saw was just disgusting on the main feed. One click, a bit of scroll and I was watching people calling republicans terrorists and being praised/upvoted/gifted. It's not just about bullshit at the top, it's also about echo chambers and the online equivalent of mob-behavior.
> it's also about echo chambers and the online equivalent of mob-behavior.
This particular example might be about the Republican party-wide support of echo chambers and offline mob behavior that led to an invasion of the building containing politicians certifying the vote of a newly elected leader and fueling disinformation about elections to weaken the trust and integrity of the system?
Regardless of this particular example, there has always been a pretty strong bias towards one side of US (and Australian, for that matter) partisan politics on the front page of Reddit; with over-the-top accusations towards the Republican Party as well as the LNP both receiving thousands of upvotes despite their poor quality.
I use duck duck go. Recently someone showed me their screen where they were using google to do a search. I was absolutely aghast. The last time I used google when you searched for something you saw a simple text list of sites (which is how DDG still works). Instead the google results were… a disaster. You had to scroll through some much garbage before finding actual search results - a list of sites. It was like google was saying, “here, look at all this trash instead of clicking a link and going to a different site”. When did google become so bad?
Yes? Offering ad space (and placing affiliate links) is how they make money.
Difference to Google is how they position themselves in regards to privacy, and that Google actually built a search engine. Both make their money by providing ad space.
Bing still refuses to index one of my pages, telling me to follow their rules. They won't tell me what rule I'm in violation of, though, and I can't tell that I'm in violation of any of them.
And this is educational content, text only, no ads or popups, no SEO hacking. Bing's analysis tool told me only that I was missing the "lang" attribute from my HTML tag. So I added it, but of course that wasn't the issue.
I reached out to them, and they replied saying that the page didn't meet the requirements for listing, but didn't elaborate.
It certainly makes me wonder what content their broken algorithm is missing.
And it sucks because it means DDG is missing that content, too.
No. It’s Bing in the same way that Uber is Elastic Search. It’s built on Bing and other tools and adds, tweaks, adjusts, etc. Calling DDG a Bing proxy is somewhere between misleading and dishonest.
Whoogle is best in class, but doesn't provide much benefit unless combined with a rotating VPN. It also doesn't solve the "GBY" problem, where the majority of search engines rely on Google/Bing/Yandex's indices instead of using their own.
This is especially dangerous because it propagates an illusion that there's dozens of engines to choose from. The reality is these three companies control more and more of humanity's ingress to information, censoring what they see fit for political/financial gain.
Funny, I find them to be identical to Google's results localized for Sweden. Are you possibly using Google Search logged in or saving cookies between closing tabs (i.e. not using Cookie AutoDelete)?
I do too. However, I am responsible for a couple of sites that Bing absolutely refuses to even index. Google has no issues with them, not for 10 years running. Those sites are effectively invisible in Bing and therefore DuckDuckGo. I’m not talking about low ranking, the entire domain is ignored.
My own personal website for example is not even listed in the search engine I use. Microsoft support see the issue but have no explanation and escalate and quote “quality requirements” for weeks now. A human review yielded positive results regarding quality. Meanwhile, I see lots of SEO spam on DDG when searching for generic technical terms.
Sure. A very simple one would be https://ipbl.herrbischoff.com, a public blocklist page referencing a resource used by a couple dozen users. The HTML doesn't get a lot simpler than that and is entirely valid markup. It's (unsurprisingly) low ranked in Google but it's there.
Not so in Bing, it's simply not there. The page exists since March 2021. Bing Webmaster Tools reads "Discovered but not crawled. URL cannot appear on Bing", giving no further reason. Also: "Last crawl attempted 01 Feb 2022 at 19:35", which means that Bing did not bother to retry for months, despite me submitting it manually on a regular basis. Clicking the "Live URL" tab results in entirely green checkmarks along with "URL can be indexed by Bing".
Another example would be my personal site: https://herrbischoff.com. Same issue. That one is listed on Google for more than 10 years.
This is fascinating, thanks. Have you experimented on allowing the bing ad bot that you have blocked? If they have some kind of retaliatory non-crawling?
Interesting theory. But the IPBL doesn’t even have a robots.txt and a different, larger site from a German celebrity I host does have the same directives and is indexed, although incompletely.
My working theory is that Bing’s selection algorithm is biased towards large and already popular sites. In the server logs, I don’t see Bing even attempting to crawl the sites I mentioned, except requesting robots.txt and the root page. Bing appears to be excruciatingly slow to update anything but high traffic sites.
Again, Microsoft Support was unable to explain this behavior even after manual, human review found everything to be in order.
I tried deleting robots.txt entirely and got only Chinese crawlers and SEO bots, but still no Bing crawl. All organic traffic comes from blogs linking directly and Google.
Bing constantly prioritizes the content to be indexed that will drive highest users satisfaction. Please follow Bing Webmaster Guidelines to better understand criteria for most valuable content.
Honestly that is probably safer. Having a typo in the url could Easley give you a phishing link. However, I also have gotten fishing adds when looking websites up so it’s not cut and dry at all.
For a good while if one searched for a (Dutch) gov institution or business google shows you the (free to call) phone number as a clickable link but the anchor has a different link to a paid per minute redirecting service. I know plenty of people who found the weird 15-50 euro entry on their bill.
Oh no, it's not. Google's ads have been used to do phishing a lot. And - at least a few years ago, it was extremely difficult to report such ads.
Perhaps it has improved recently, but it used to be a plague in crypto - people getting ads for phishing sites instead of legitimate ones, losing money, and Google being unresponsive to reports.
Honestly I think while this seems absurd if you're relatively knowledgeable about the internet, it's really not something that should be surprising or even particularly shameful? Like, why wouldn't they if it works, is easier to remember, and makes sense to them?
The reasons this is actually potentially bad are pretty deep in the internet wonk weeds, where you get into questions of gatekeepers and provinance of information and it shouldn't be surprising most people don't care about those things: those of us who do have failed to provide them with better tools.
On some level it's a little like saying "my dad sent me an email and he didn't use pgp! Can you believe it!??"
I know where DDG gets the results from, but in all the years using it, it has never failed to find what I am looking for. Or, I’ve never needed to check google because DDG didn’t find what I was looking for.
i have a guy in digital marketing tell me that his friend does SEO and he does wonders with obscure keywords and shit. that friend is a freelancer and earns a good payday.
When you want to insert your brand in every fucking imaginative keyword as opposed to people "searching for something",
why does internet advertising revolve around everyone assuming every person googling something "WANTS TO BUY SOMETHING"?
If every heavily SEO’d result was produced by a company that produced a directly relevant product, I don’t think we’d be as disappointed with the content.
The truely garbage content is produced as cheaply as possible (scraped, generated from a data source or generated via “ai”) to capture advertising revenue, often via sub prime advertising networks (or a number of middleman networks).
But to your point, not everyone wants to buy something, and not everyone needs to.
Much of the content out there is simply trying to capture your attention and make you available to some of the worst advertising and ad networks (read scams, lead gen, fake buttons, affiliate crap).
> why does internet advertising revolve around everyone assuming every person googling something "WANTS TO BUY SOMETHING"?
Rather I think it’s because everyone who buys ads has something to sell.
Search ads are “direct action”. You click a link to do something. Ads on eg. TV are more about “brand memory” - reminding you they exist. When you watch tv you’re passively taking in information, but when you’re searching you’re actively trying to click something already. It’s a better fit behaviorally.
One feature that I really wish more search engines would have is the ability to blocklist certain domains, particularly ones whose results are never relevant or helpful to the query itself (Pinterest, Quora, etc). It could even be used as a factor in the site’s search rankings.
I think kagi does that, I use while was in beta, you also can assign a priority to sites, like normal or boost. Kagi is a paid service and doesn’t shows you ads.
I've been thinking in this same direction. Especially the community-driven part. Google seems to be more interested in what corporations and advertisers want, rather than what users want. With their tendency to crowd-source their AI training, I'm surprised they don't let users vote on search results.
If I were to make a search engine, I'd definitely give users more control over their results. Block crap sites, vote up your favourite sites, vote down questionable sites, maybe different context profiles, because if you're searching for Java in the context of vacation or news events you want different results than if you're searching for it in a programming context.
There's so much that search could do better than what Google is doing, but I'm not doing it because it's way too much work, and it requires serious resources to index everything.
He's doing that thing [1] where's he's writing about a thing and presumably wants me - the interested reader - to know more about that thing because it's the thing he's spending all his time on, but he gives zero navigational options to his thing. So as that interested reader, it's down to me to find the name of his thing [Mwmbl] and then (hilariously, given the context!) use a search engine (probably The Evil Google) to find HIS thing.
Seriously, people, if you're writing about anything at all, making assumptions is always a bad idea. If you're writing about a product, make it more than easy to get to it. Provide plentiful CTA's (that's Calls To Action, defined so as not to make the same mistake of assumption) - links, bittons, a big banner at the top: ("I'm building a non profit search engine called Mwmbl! Find out more").
IMO you also need an about page. Why are you doing this? Who are you? Why should the world pay attention? And stick them in your navigation. Overstate your cause rather than assume that people will get it. It's a rad product you're making! Don't undersell yourself! :-)
Honestly, the easy part is building a search engine, like just document retrieval stuff and domain ranking, SEO-mitigation etc. Anyone can build a Google '98 and get it to work well, not that hard, doesn't require all too much hardware. I have done that and got one running out of my living room.
The tricky part, if you want people to use your search engine for more than the novelty factor, and what most Google competitors struggle with is drawing the rest of the damn owl. For example, commercial searches, local businesses, that sort of thing. As much as Google flounders with some queries, the overall package is still really good.
I agree, the whole package is the difference.
I'm trying to use DDG as much as possible, but I live in Italy and for local stuff Google services are still unbeatable.
E.g. Apple Maps used by DDG are a no go where a live for all but plain directions. OSM data used by many map apps is good, non commercial data sometimes even better than GMaps, but businesses data is only on GMaps, there's no way around that.
Quite a neat way to crawl websites using a browser extension. That by itself is a form of donation to the search engine. Maybe in the future you can have dedicated software for self-hosted clients that users can run to crawl and index websites for mwmbl? Kinda like folding@home.
How are the batches of URLs to be crawled generated/discovered and posted at your API?
I have also thought that distributed crawling with the help of browser extensions, and/or clients like folding@home, could be a good idea. But how to deal with "spam injections"?
Get 3 people to scrape it and see if there are significant differences.
Some might, because of A/B testing or news updating, but even updating news will get a positive similar page and those that don't should probably fall into an exceptions category until it can be determined what can be done about it. Maybe a flag in the URL to give you a static page or just accept that it changes often enough that even faked pages won't last long?
Then I'll just add 3 million bots to the network (or just enough to have about 50%) and I can guarantee to win the A/B test against an honest client most of the time.
It's an arms race, but this is mostly a question of rate limiting account creation, assigning a trustworthiness score to different accounts, some network analysis to detect coordinated accounts, and having some trusted accounts (run by the project) that can help double check results. After an account loads poisoned data, you can detect this after the attack (user reported spam), and then block (or probably shadow ban) the malicious account.
You make it sound easy but companies have been trying to fight this stuff for ages.
You can buy a trustworthy residential IP for low cost, you can buy them in bulk in the thousands. All of them are real residential IPs from any ISP of your choosing in any country. You can rent Chrome browsers running over those IPs, directed via remote desktop and accessibility protocols (good luck banning that without running awful of anti-discrimination laws). You can do all that for under 1k$ a month for like 1 million clients.
My workplace has been at the other end of DDoS attacks directed by such services, best you can do is ban specific Chrome versions they use but that lasts until they update.
It's an uphill battle that you will loose in the long term if you rely on client trust.
In terms of spam injection (the concern from up thread) I don't think DDoS is relevant. If the core project manages asking clients to process URLs, they'd just IP ban any client that returns too many results. DDoS is a concern for other reasons though.
I think in this specific case, the spammer is on poor footing. The spammer wants to inject specific content, ideally many times. With double processing of URLs and the spammer controls 50% of the clients then there's a 50% chance that a simple diff would show the injected spam. The problem is that the spammer needs to do this many times, so their injection becomes statistically apparent. If the spammer can only inject a small number of messages before they are detected, then the cost per injected spam will be quite high. Long running spam campaigns could eventually be detected by content analysis, so the spammer also needs to rotate content.
Obviously you can play with the numbers, the attacker could try to control >>50% of the clients. The project could process URLs >2x. The project could re-process N% of URLs on trusted hardware, etc. It's not easy by any means, but you can tune the knobs to increase the cost for spammers.
> but this is mostly a question of rate limiting account creation, assigning a trustworthiness score to different accounts, some network analysis to detect coordinated accounts, and having some trusted accounts (run by the project) that can help double check results.
Then OP has to do things that don't scale: Review some pages and identify a subset that can be trusted. Then OP can compare their downloads to new accounts and mark the bots.
Then the botnet will just be honest for like a year before it abuses the network. Even better because now honest new clients can be kicked as they disagree with the bot majority. So now the network bleeds users.
Checking which account is honest isn't too hard, you detect that there is a "problematic mismatch" between two clients. So the project runs their own client to check. If one has an exact match, then you'd question the other.
There is a challenge for sites that serve different content based on GeoIP, A/B testing, dynamic content, etc. So some human review of the diff may help check for malice. If there's literally spam, human review would clearly detect this and that bot is distrusted.
Then I'll simply use more bots to get 80% of the network, then I can almost always win any disagreements and your "problematic mismatch" never triggers.
Plus I can now cause you to have to run your own crawler anyway and either slow progress or cost you a lot of money.
Maybe I misunderstand, but doesn't that mean you lose the benefit of having distributed crawlers if everything has to be crawled (again) locally somewhere?
YaCy can do distributed crawling and exchange the Indexes (in Peer to Peer mode). I have some node's who just receives and send indexes without crawling (much less storage intensive).
The "funding options" part has the unsurprising blind spot that, maybe, a search engine is the kind of basic infrastructure that ought to be paid (at least in part) by... The taxpayers ?
I have a long standing bet that, at some point, some company will be "globalized" (operated under some common funding by many different countries, like many research projects or defense organization or aid funds, etc...), and the "search engine" part of google is the prime candidate.
That being said, I'm from Europe, so "sharing the cost of something useful" is not culturally untolerable.
Far fetched and controversial opinion, I know. We'll see.
Considering their goal was to reach £50 a month to upgrade their server… I think any hope at convincing major governments is a few years off.
But I do wonder if the national archives or the library of congress could be a good host for this sort of project. Not sure I agree it should be run by a government… most don’t have great histories when put in charge of gatekeeping access to information.
> The technology for organizing the world’s knowledge should be owned by everyone.
This is what really nails it for me.
There's far too much black box in pretty much every major search engine out there. Maybe it's by design "so that people can't game it". Even so, it's not working very well.
I'm excited for the next 10 years to see what we (humans) come up with to solve the state of the internet, because something's gonna give at some point.
Why though? 99.9% of people don’t care, and the internet (Facebook, instagram, TikTok, YouTube, gmail, the occasional Wikipedia) work fine for them.
The overwhelming vast majority of people just don’t care about the internet outside those major few silos, so as far as “humanity” is concerned the internet is working as intended.
It pisses me off what’s become of the internet, but I don’t personally see it changing.
You are right about people's current opinions, but seem to be assuming that given a better option, people would continue to make sub-optomal choices.
If so, I don't agree. Your small minority's job is to deliver those alternatives, and to feed the flames while the rest of the world makes the transition. Which they will do because the next thing is clearly so much better.
Have more faith in most of humanity and doubt yourself and similar others for having failed so far to disrupt this industry with better technology.
I guess it’s a good thing 99% of people don’t run the world. If products were dictated by the majority we wouldn’t have macOS, nor even iOS which is technically a minority by raw users. Obviously some users use more intensely than others, which is why catering to power users is a totally viable and legit strategy.
I get an incredible value from search engines. Google even (their shopping and book search features are very helpful). But right now, I am liking paid search as the way forward. Kagi is doing pretty good things right now. I love how I can up the weight of certain domains so that their results come in at the top without having to add site:awesomesite.com at the end of every search string. In fact, I can have 20 sites that I trust a lot that show up pinned at the top of the search for every query. It's 10 bucks a month, but I find it valuable.
I think that we are past the point the search engine could just crawl the web and rank results based on some heuristics. We need both community curation and get librarians involved with their classification systems, because in 3 years the results are going to be dominated by automated GPT-xy content farms.
Case in point: www.forkandspoonkitchen.org
The first search engine that provides community curation and manages to get most tech-savvy people on board, classifying the content for free, is going to reign in the upcoming decade as Google loses its grip.
This seems really really naive. Do you really think a non-profit is going to fight the hordes of spammers, scammers, seo masses, mechanical turk hordes, etc that are going to game your system?
Not sure I missed the sarcasm here.
Why would a non-profit not fight spam etc.? Why would a for-profit fight spam do so?
I see the largest internet companies, including Facebook, Twitter, and also Google, fight spam and other harmful content only to the degree absolutely necessary to stay somewhat usable. Which makes sense because it's costly and does not generate profit.
I would expect a non-profit, however, to focus much more on fighting harmful content because it centers around the user experience, hence quality of the content.
I don't see a guarantee this works in practice, but the respective incentives seem clear.
The general answer here is that a nonprofit cannot maintain the (massive) amount of resources it takes to address spam/abuse/blackhat SEO etc, while a for-profit entity ostensibly has a profit motive to do a decent job, and the resources to do so.
When a for profit entity is more successful at fighting abuse, their users are happier and they sell more ads and so can devote more resources to fighting spam. When a nonprofit successfully fights spam, they don't get more resources, and the spammers upgrade their toolboxes, because they do have a profit incentive.
Like @daoudc writes, Wikipedia shows that the nonprofit can maintain the resources because they attract volunteers.
If you create a search engine where users can report spam and get some form of karma for valid reports that is shown in their social network, then it's quite likely that the users have enough momentum to get ahead of the spam.
Worth noting attempts at non-profit search engines are not new. In 2015, Wikimedia Foundation attempted to start one called the “Knowledge Engine” using at least $250,000 from a grant. Wikipedia likely started the project as a response to Google’s use of “knowledge panels” based on Wikipedia Creative Commons license alongside search results in 2012, which reduced traffic to Wikipedia.
Also worth noting that Google is a significant donor (and now enterprise customer) of Wikipedia, but unclear if this had any impact of Wikipedia’s choice not to continue the project.
I don't think this accurately represents Wikipedia's relationship with Google. Wikipedia is thrilled with the knowledge panels. It has dramatically cut Wikipedia's hosting costs and spread its reach. You're making it sound like Wikimedia and Google were antagonists over the knowledge panels, but as I understand things it was the opposite.
Which is the problem with non-profits, the transparency requirements are at best minimal and all non-profit vs for-profit means at a super high level is that there is: no equity distribution, government approves of its mission, and there is no distribution of excess cash flow.
There frequently non-profits that use excess funds to unnecessarily expand beyond the original mission, for example Wikipedia — or that pay staff, especially executives way beyond what most donors realize.
To me, being a non-profit is what it is, I don’t read too much into organization being a non-profit.
As the underlying project discussed by this post is a search engine, I searched for “mwmbl” on mwmbl.org [0], and no results were found! Relevant results like the main site and GitHub repo show up when searched on Google or Kagi.
I searched for "Google" on mwmbl, and while the first page of results found many results, including Google Patents, Google Bug Hunters, and a Google Books page on a 2004 book about Anarchism, it did not find Google's home page.
I searched for "Elephant," and I got a Wikipedia page about a specific Elephant statue at Coney Island, a UK elephant charity, and a blog post about Haskell ("the elephant in the room").
It's unfair to poke fun at a very small project that admits that it is far from done yet, but it's gotta figure out a way to crack the "which pages are most likely to be relevant" problem or else it's not going to be useful.
To be honest, I didn’t entirely mean it as a comment on the search quality. But you would expect a search engine to return results about itself, and it was amusing it didn’t!
I don't know what the world needs but I need a personalized search engine. I would like to filter out anything to do with sports. I would like filter articles that contain marketing jargon and technobabble. I would like to filter articles written below high school grade level. And so on.
I totally like the idea but I dare to doubt that this would solve the SEO problem. Website owners who are participating in those notorious affiliate programs or earn money with ads will still use the search engine to drag people onto their generic websites, using methodolgies to fit the search engines ranking mechanisms, no matter if they are public or not.
SEO and all it's results seem to be immanent to the system.
You can have a semi-manual system, with user input. So websites filled with SEO get manually flagged by users. Of course, you need to trust your users as well... this can be done with a Web of Trust-style reputation system: users endorse each other, and you can build a reputation graph/tree that traces reputation and easily cull bad subtrees. If this endorsement system conserves reputation (you give a fraction of your reputation away each endorsement, and new reputation is never created), then it becomes sybil-proof, where it's not advantageous to create say millions of users to increase reputation.
why can't all the users that are being paid to promote spam endorse each other? And won't distributed trust systems just make some people "trust billionaires" and others "trust impoverished" though no fault of their own? If everyone trusts Oprah she'll have the types of same power people complain about billionaires people now. Basically influence. Also, anyone who's close to her gets the blessing of her influence where as if you're far removed you get nothing. Seems like the just reproducing the exact thing so many are trying to counter.
Well, you need to define your objectives. No system is robust to a failure of all its actors. If every user (and even developer) is ill-intended, no system will give good results. So we need some "hopeful" (and accurate) assumptions.
One might be that the typical user can sensibly elect a few individuals to trust -- it could be developers (which are a natural choice for trust), to activist and publicly visible individuals (even close friends). Then presumably you could adopt his trust model (such individuals could be roots in independent conservative trust webs/graphs). I think a very large number of such webs might be computationally expensive, but hopefully you'd be able to find someone you trust or start your own independent graph (if you trust no one, you'd effectively lose all anti-SEO measures I guess). This very naturally leads to a decentralized reputation system!
Mutual endorsement, if reputation is conservative, is like giving eachother reputation. If both users reputation is equal, there's no net gain (total reputation is always constant).
If one users reputation is higher, there's redistribution, but algorithms would need to carefully weight reputation linearly in all decisions, making redistribution not advantageous, solving any issues with sybil attacks[1].
I'm hoping that having community moderated search results will limit this problem. The problem will be SEO people trying to infiltrate the community, which may be tricky to solve. But Wikipedia has had to grapple with similar problems, so I think it is solvable.
Sell weekly SEO guids to rank well and punish last weeks tips. (joking)
What you want to do is put the SEO monster in front of your cart and make it do useful work. You basically got an army of hard working people with money to burn who will do anything. What is there to complaint about?
Worth mentioning is the Alexandria.org project [0]. It is a non-profit search engine built on data from Common Crawl. The coverage is limited because of Common Crawl, but the relevance is decent. They also provide an API.
I believe one of the biggest impacts toward breaking up Google's monopoly on search is making them open up access to their index, even requiring Google to provide direct API search access for others to build alternative search products. They have a search API today, but it is prohibitively expensive to build on ($5/1000 calls).
I built a fairly popular search engine a couple years back, but the cost of Google's search API and increasing number of bot attacks make it difficult to reason keeping it online.
With all the great progress in large language models lately, and them being excellent text compressors, I've started to wonder if you couldn't just replace a search engine with a like 100mb file full of weights that let you query essentially google scale results except all locally.
Yeah you picked the hugest SOTA one of them all, but there's smaller ones like this https://bellard.org/libnc/gpt2tc.html that run well even on CPUs that might run well fine tuned specifically on search results (or at least just code queries).
The only significant difference between those models is the amount of data, the main(or the 2nd most important) reason why you're using a search engine is how much data is available on it.
If you want to search through an incredibly limited % of the web then yeah it can be a solution, but even the lamest search engine company out there would outperform a GPT-2 like model running from your laptop.
I'd quite like a browser extension that records all my searches and where I end up, just for my own review. I feel like many of my searches aren't actually searches, but I can't quantify that at the moment.
Feels like that would be good info to share, once it's depersonalised.
You could also build your own, most search engines have a specific pattern to how they encode the search term in the URL. Although, I suppose that doesn't support auto-complete
The way I see it, Google is no longer in the business of searching websites, but in the business of ranking them from at least a decade.
I still remember helping a friend finding informations on the accounting balance of Rome's, Italy, public transport, and finding the most relevant link buried deep at around page 20. The first 15 pages were almost completely news websites with completely irrelevant news to the search query but they would consistently rank much higher.
The main enemy of a better search engine are casual users who are satisfied with Google's mediocrity and don't seek nothing more advanced and better. Power users are the one who suffer the most.
Google will have to reinvent itself or it will eventually destroy itself with
negligence of its core business. There isn't yet critical mass of casual users who think Google sucks, all they think is that the Google is internet. That's their intellectual level.
The internet, somewhat ironically, really needs a search engine that works in the current day. You can't find anything anymore. It's like Google has been un-invented.
Hopefully some day soon the internet will be searchable again.
Thanks to everyone involved in attempting to make this happen (preferably in a non-profit-maximized way).
While monetization is certainly a great thing for a group of people wanting to go after things they and others enjoy, it's also exactly what incentivizes people to game the system and pretty much what brought us into our current place of spam, spam more spam and barely any way to discover things that are actually good.
I tend to think the quality of the content tends to be better when people don't think about stuff like user retention or subscriptions, but rather how it will actually reach people that care. Good search/curation is a key component for that.
Of course such a world free of implicit monetization will require it to be explicit (Patreon-style), but that should massively realign incentives.
The biggest challenge with making a search engine is to combat adversarial SEO. It's an issue that's very easy to be overlooked when you are small, but at Google scale, your enemies have billions of dollars to make from your visitors.
I bet Google spends at least as much to combat that, and it's extremely hard to deal with while being open-source. It's useless to call for a non-profit search engine without tackling this very core issue.
They haven't given up, they are staying exactly where they need to be.
All the mainstream search engines' priority is to maximize ad clicks/impressions (or collect data to target future impressions), either directly on their own property, or indirectly when linking to websites that embed their ads.
There's no reason why they can't detect ads or analytics and use that as a negative ranking factor (so that all other factors being equal, a non-ad-infested result would rank higher than the ad-infested one), but this would go contrary to their business model.
I think it's possible - red flags (for example blog spam or commercial sites) seem easy enough to catch for a human; probably in an automatic way too, at least as long as you're small enough that they don't specifically target you.
Google's search results are so bad I can't really alledge incompetence here but have to wonder whether there's some different motivation. Maybe it's that low quality search results tend to be plastered with ads, which they get a cut on.
> The paid subscription model
> Donation funded, non-profit model
No! There is a 3rd! You could do a search app eco system where you leave the unlimited overly complicated puzzles a search engine could address as an exercise for the user.
I always have a bazillion ideas but couldn't think of a single good phone app before mobile phones. I mean, should I want my phone to be a gaming console? It seems ridiculous. Writing is writing books, all other kinds are watered down. Do I want to write books with an onscreen keyboard? It all sounded idiotic, nothing worth using.
But the idea you mention, typing an overly popular domain name without extension should take you to the website directly... What you are trying to say IMHO is CLI! Search is just the failback if the provided query/instruction doesn't make sense to any of the apps.
I cant think of many but there are no doubt thousands of activities that could benefit from an at least somewhat themed search engine. An app could be a biochemistry web directory that ranks results from a chosen sub folder above the normal results.
Any FOSS or other company could create a web dir tree with the few or many pages about it self. A check box lets you pick the ones you want to query. Normal results go under those results. The biochem wont bother you when searching for pokemon.
People love my stores. What they really want is to see illustrated results from my inventory above all other results. Uncheck the box if you are not in the mood. (edit: I'm joking of course but I do have a good fews shopping apps that I actually use)
This is the kind of idiocy that makes me despise the developer community every time i see something like this. It is one of my pet peeves, so if you're going to have an opinion on this comment, please, read the whole thing.
The ad supported free internet is one of the most important business models the world has arguably ever seen. Very few can argue with the fact that poor kids in developing countries over the past two decades and longer have had their lives changed beyond anyones wildest dreams thanks to the free resources at the tip of their fingertips.
On the same note, much of the wealth accumulation in the developer community has been on the backs of this very business model. The immense demand for dev talent and the astronomical salaries paid out is a consequence of the difficult financial choices made by so many before us.
When i read absolutely low-effort activism such as the text in the link about how(paraphrasing) 'sEaRcH eNgEnEs mAkE mOnEyY" and thus they are bad. I'm astounded at how intelligent people who can write code can simultaneously be so fucking moronic in their grasp of economics.
The web is an ecosystem. There are always going to be incentives that don't fit your moral compass that are getting optimized for and against. The answer isn't to burn it all down and shit all over a business model because it apparently doesn't fit your childish understanding of the ideal. By all means, compete, but atleast try to understand the various actors and participants in this complex web of entities and what role they're playing in the flow of investment, content, data and economic activity that is far more nuanced than "wEb rEsUlTs wIll B beTtTeR iF nOT oPtImiZeD fUr $$$ "
> I'm astounded at how intelligent people who can write code can simultaneously be so fucking moronic in their grasp of economics.
Absolutely nothing surprising about that. Intelligence without knowledge is not very helpful. If all you know is how to write code, you will suck with other things, even if you're intelligent.
The bigger problem is that people tend to downplay the knowledge that's required to do something, simply because they do not know how much they don't know. It gets worse the more intelligent you are because you're more confident in yourself then.
Case in point: name of this project. I've read it like 10 times on this page already yet I still can't spell it from memory. I could paraphrase:
> I'm astounded at how intelligent people who can write code can simultaneously be so fucking moronic in their grasp of marketing.
(but won't, since, as I said, it's not astounding at all; just for illustration purposes)
Great idea, and "instant search the web" would probably a better pitch then "non-profit search engine". Interesting argument that google doesn't do this because it isn't compatible with their ad model, but that doesn't mean a new ad-funded search engine can't do this. For google it might be billions of dollars in lost revenue while they adjust their ad model, a new ad-funded search engine wouldn't have this problem.
> Frictionless ... For example if you are typing “facebook” or “hmrc login” you could go straight there from the address bar.
No thanks. I sometimes do search for "company name" looking for the wikipedia article for the company, or news about the company, or information about the company in general. If you used facebook before, then it's going to autocomplete as soon as you type "face" in your addressbar, and you won't need the search engine. So if someone searches for facebook, they're either using the browser for the first time, or they're looking for information about facebook. Latter seems more likely.
- on the one hand I really want free, open and non profit services to succeed
- at the same time I greatly value the user experience
Don't get me wrong: These two things can go hand in hand. There are tons of good examples out there.
But, the closer you get to classic user-centric applications and leave the software developer bubble, the greater the discrepancy becomes in my experience. Brave, DuckDuckGo, Firefox and so on are desirable. But I always feel like I am missing out on the UX.
Google still yields better search results FOR ME(even with all those ads and clickbait).
Firefox still feels a bit dated and slow compared to Chrome.
I value the positive effects of free software so much that I am willing to accept limitations in usability in the hope that it will improve over time. But I feel like it should not be this way.
I can't support every project financially or contribute to its success as a contributor. My time and financial resources are limited.
I haven't really found a solution for this problem. My best guess is that the government should intervene in the free market and install market barriers to tame giants like Google. But this is repugnant to the liberal in me.
Tell me how this doesn't quickly devolve into a consensus-rules hellscape, where minority views are either ignored or certain minorities are artificially boosted.
There is no way that design choices (especially the ordering of results) can be made in a way that pleases everyone. So either you dumb it down to the point of meaninglessness OR you enforce a mainstream-only ruleset.
The cost of building a “general” search engine for the “whole” web is astronomically high, in the 10s to 100s billions. It’s not achievable, Google were only able to do it by growing as a business at the same time as the internet itself. I don’t believe it’s possible to compete with Google (or Bing) by starting at zero.
The route forward, and what should be advocated for, is a distributed network of search engines, each for a specific vertical. If it operated as a cooperative they could share expertise and technology, they could then build a “meta” search engine for the co-op that combined all the results from the specialist niches. Each member basically “owning” the “franchise” for a specific type of search or category.
So, I don’t believe a single non-profit is the answer. More a co-op type arrangement where the co-op organisation (which may be a non-profit) has a mission to advance internet search through it’s network and strategic investment.
That the world needs a non-profit search engine is near trivially true at this point. So good luck Daud.
I think the pertinent question though, is what's the best way to demonopolize search. Maybe the answer to that is non profit, maybe something else.
Google has a most search users. They have an even higher (much higher) portion of search revenue and essentially all of the sector's profits. One advantage a non profit might have is going after the low profit parts of search. Use cases where Google is likely to be under-serving users.
Also, search isn't just websearch anymore. It's a way of calling a calculator, translating, etc. It's a text box that does stuff. The newest gen of language models may be the technical catalyst for some rapid evolution in the "clever text box" space. Google is obviously super active in this space, but shifts are a good time to get in.
Where would you skate, if you were skating towards where the search puck is going?
A recent comment here mentioned search in early browsers (1991ish). The browser would fetch all links from the current page n levels deep in the background and uses that to build a local index.
I wonder if something like that could work today, only with the index being shared across the user base.
The benefit would be that it’s a decentralized system. No giant infrastructure required which needs to be paid for by a big corporation. Basically, the infrastructure needs would be outsourced to millions of devices. And for websites, users and crawlers would be the same thing. Which is to say, you cannot block one without also blocking the other.
It could also add feedback mechanisms. Active ones, such as commenting on pages and discussing them, as we do on HN. But also passive ones such as tracking how long the user interacted with the page, to score the value of pages/domains and improve the ranking algorithm.
I think Google's results could be a lot better but I'm relatively ok with my search being provided by a for profit company. Their incentive is to get me to want to use their product. A non profit with that much power might be more tempted to manipulate search in ways that suit their personal preferences.
I don't mind profits but monopolies are always bad for the consumer. I wonder what innovation would have happened in search if it had been more competitive.
It's unfortunate that with all the immense value that search engines provide the idea of paying a small monthly or annual fee to use a search engine is incomprehensible for most people.
A small monthly fee could mean a lot of money in the developing countries, plus there are countries where most people don't have access to international banking system. Such a system only makes access to information for a lot of people almost impossible.
How should I pronounce this search engine? I know naming is hard, but if you want something to be easily adopted, having a sticky and pronounceable name is paramount!
This search engine is supported by The Bill (and Melinda?) Gates Foundation, The Organization for Promotion of Democracy, The Organization for Prosperity, The Organization For Truth And Transparency And Against Fake News, The Organization Against Renegade Knowledge, The Organization For Helping Silly Citizens Think Better, The Organization For The Truth About Qatar, The Organization For Freedom And Good Things And Not At All Tied to the CIA, and some other folks.
I hope I'm not too late and this doesn't get buried - anyone interested should check out https://www.findhelp.org/ ! I work here and we are super hiring for engineers :)
Edit - ah, he means the search engine should be a non-profit. Not what I thought he meant.
> [google gets 40 billion a year from search.] I can’t even conceive how big it is. Just 1% of 1% of this would be more money than I’d know what to do with ($4m).
Ouch. I wish you the best but that statement makes me lose hope. Employees are expensive. Servers aren't exactly cheap either. And unexpected mistakes along theyl way cost a lot.
I wouldn't say Google Search isn't improving because of the number of employees. Google Search is exactly what it needs to be to serve Google's interests - it's just that their interests don't align with yours.
It does, I just tried it with 102.1. You can add a "keyword" to use a search engine, and once you've done that, you can add it as a search engine to the url+search bar (the main one on top, I don't know how it's actually called) as well, and you can set it as your default search engine.
I mean I used to have that but I seem to now. Did it move? This is 102.0.
Edit: You can make a bookmark and add a keyword, but that doesn't help me use it as a search engine. It used to be you could just create one from the contextual menu, what is this convoluted process.
> Google tries to work out which sites are interesting by how long you spend on the site.
How would Google know how long you spend on the site? It only sees what links you clicked and doesn't know what happens next. (Unless the website uses Analytics, but Analytics doesn't affect search ranking.)
Let's not forget they also have a very popular browser that itself collects and bundles back home a lot of usage infomation from most of its user base:
I think this is a bit misleading. Chrome gets access to various data to enable various features -- for example to pass your location to the website (when you allow it to do so). Browsing history is used e.g. to power URL auto-completion.
This doesn't mean that this data can be used to inform Google Search ranking. That would be very shady and potentially illegal. I work for Google, and even though I do not work specifically on Search ranking, this doesn't sound like something that could be happening.
I'd be surprised if they don't have the capability to enable "linger time" statistics that collate frequency of new web site loads, memory demands, etc for performance.
This relates to the "how can they know how long I look at a web site for" asked above - if not specifically they do at least know the answer stochastically.
> Google has an incentive to rank pages that contain Google ads because it makes them more revenue. Google has an incentive to rank profit-making sites higher so that they make more money.
They don't have to intentionally rank them higher, but this fact could prevent them from choosing to rank ad-filled sites lower even though it would do wonders for search result quality because ad quantity is usually a good proxy for spamminess and trash.
Appears you’re in the UK, is that where you intend to registered the non-profit? If so, in the UK, what are the real costs of forming a non-profit, keeping records, generating reports, (shut it down), etc.?
When using a VPN to access Youtube, the adverts played to you will be in the local language of the VPN destination, yet Youtube can deliver the appropriate language content. Strange that!
How many advertisers advertising in a VPN destination where a particular language is dominant would advertise in another language? YouTube doesn't care whether you can understand it. Those advertisers may not be looking to advertise to people who can't understand their language either.
How many advertisers outside of that country are going to ask YouTube to play their ads in that country in a language different from the dominant language? Probably not many.
First result is startrek.com, second result is Star Trek into Darkness IMDb but 3rd is xkcd.
It then goes off into Q and William Shatner Wikipedia links and Muppet Movie IMDB in Russian.
I tried putting a plus in front of IMDB and quoting Star Trek. It doesn't seem to be able to find Star Trek on IMDB. I admire the concept, and it is extremely fast.
I wouldn't be surprised if you don't want to talk about it because not could try and avoid the pitfalls, but what is your plan for avoiding bots trying to overwhelm the search scraper bot?
Are you looking to build a trusted network who can verify and validate other users responses on an undefined period?
Not for profit? Ecosia is one. They do make money, but in general it's not-for-profit organization that use majority of money they make to plant trees.
I bet two-thirds of hn crowd is some how affiliated to the progress of search, ads, lead-generation, analytics, user-tracking (FAANG) etc. Think of their children...
For an information based on standards --- HTML as a document markup language, HTTP as a transport layer, TLS/SSL for security, TCP/IP as an underlying networking protocol, among others --- one that is conspciuously missing is an indexing standard.
That is, even if a site wanted to, there's no way for it to declare "I have content related to X". Even better would be if these indices could then be distributed in a cache-and-forward model similar to how DNS (another distributed discovery index) works. There was some exceedingly rudimentary attempt at this through elements such as keyword meta tags, but even at best these referenced a vanishingly small fraction of the actual content of a site or article. Sitemaps also address a component of the problem, but again, only in part.
Some might see a few immediate issues. One is that not all site are sufficiently dynamic to know what content they actually contain. To an extent this might be addressable through extension to the webserver protocol such that a server would be aware, or become aware, of what content it contained.
Another is that a site might in some instances be inclined to misrepresent what it contained. This may be hard for some to believe, but I'm given to understand it occasionally does occur. To help guard against this, there might be vetted indices, in which one or more third parties vouch for the validity of an index. These reputation-sources could of course themselves be assessed for accuracy.
But if sites were responsible for reporting on what content they actually contained, and could be constrained to doing so accurately, a huge part of the overhead in creating independent search engine, and breaking the seach-engine monopoly, would be eliminated.
One might imagine why certain existing gatekeepers over Web standards might oppose such an initiative.
There would still remain other problems to solve within search space. It's possible to divide General Web Search into a set of specific problems:
- Site crawling: this includes determining search targets, any exclusions from such lists, and performing the actual crawling. Self-indexing addresses part of this problem.
- Indexing: Mapping of actual contents to keyword and query terms which might address that content.
- Ranking: Assigning a preference / deprecation to specific sites. This is essentially a trust / reputation assessment, with a canonicity / authenticity assessment (e.g., where did a specific item or document first appear).
- SEO: This is the Red Queen's Race issue in addressing insincere / malicous actors. Strong and durable penalties for abuse, and long-term reputational accrual, should be useful here.
- Query interpretation: There's a considerable art to figuring out what a question actually means. In some cases queries should be taken strictly verbatim. Quite often, however, interpretation is necessary. How those alternatives are posed might vary, with an option not often employed presently being to suggest a range of potential interpretations or related queries which might produce better results for specific query scenarios.
- Presentation: This is generation of the serch engine result page itself, incorporating several of the other considerations listed, but also addressing usability, accessibility, clarity, and other concerns.
- Revalidation: As the editors of the Hitchiker's Guide observed, the Universe is not static, and circumstances change. Revalidating, revisiting, and revising results and reputational assessments is necessary.
- Monetisation/Funding: I'm partial to a public goods model, or perhaps a farebox role via ISPs, pro-rated to general income/wealth within a region. Advertising, as a famous Stanford research paper prophetically observed, forces disallignment with searchers' interests and objectives.
It's perhaps a bit on the side but still part of the topic of search.
Have you noticed how newspapers systematically do not supply a clear source for their articles? It's especially prevalent on political cases where there are easy-to-link paper trails. This makes it a lot harder to find the source for their article, so you end up just taking their word for their angle on the story.
A great recent example is Biden's Executive Order on the protection of women. When the newspapers writes about his EO, they're never doing it form a neutral standpoint. In this case they're either pro or anti abortion. But if you want to know the contents of Biden's EO for yourself, then you're forced to search for it. And depending on the search engine, that might also be hard because also search engines are politically biased.
Just so we're clear, this post isn't pro or anti abortion. Instead it's an example on how newspapers systematically force you to take their word for their angle on any given news story. So if you want to know source material, then you're forced to search for it. And when you do search for it, you're then at the mercy of the political bias of the search engine.
For that reason I'm not so sure a non-profit search engine will make political biases go away, especially when you consider what happened to Wikipedia. While not a search engine, it is a non-profit and communal project that set out with the ideal of being truly neutral, but in the end it failed at that, and some would say spectacularly. And the main reason is exactly bullshit, or rather the BS that comes with political bias.
Don't get me wrong, it's still a great source for information, but when you search for any topic that is in any shape or form politically sensitive, then you have to know about Wikipedia's clear political bias beforehand, or else you might take their angle as gospel.
This is especially insidious when it comes to search engines and also social networks, because most people assume that what is shown to them there is neutral, or at least coming from a friendly party. But then it turns out, that's not always the case.
When you systematically get biased information, then it's a democratic problem, because it prevents people from making up their own mind about political topics. Thus when people finally vote, the risk is that we get a society that does not reflect peoples actual opinions.
I think most people in here has been on the receiving end of that, no matter which side of the aisle you're on. And the result is always resentment and bitterness which in turn does not make for a healthy democratic environment.
Instead the political bias should be more clearly visible and out in the open on both newspapers, encyclopaedias and search engines alike. And while a non-profit search engine would certainly save you from corporate interests, it still won't save you from political ones, though it might be a good trade-off to save privacy.
I've noted that many newspapers don't provide external links as a rule, though that reluctance seems to be slipping somewhat. Unfortunately, that's accompanied by an tendency for those external links to rot with extreme speed and prejudice, which ... might actually justify the reluctance.
Linking an archived copy of the content has merits, though increasing The Usual Suspect (the Internet Archive's Wayback Machine) itself has difficulty in preserving complex content. (The rather less transparent Archive.Today is often superior in quality if not necessarily in trust or reputation, and I say that relying on it heavily myself.)
I know that I've noted news organisations which do seem to link sources relatively freely, and thought I'd commented on it previously. I'm not finding any previous mentions by me here at HN (my usual personal quips trove, amongst other benefits). Though if memory serves, the New York Times tends not to provide links where it really ought to. I believe the Los Angeles Times might have a greater tendency to. Perhaps also NPR and/or The Guardian which I tend to rely on, though I'm not certain that's been my earlier observation.
In the 19th century and through a good part of the early 20th, newspapers would not only reference specific documents or speeches but publish them in whole. In large part, this is because that was the only way to distribute and reference the material. That practice has waned tremendously, and we often see material only through commentary and reference, rather than in original form. I've come to view this quite dimly.
I'm also fairly certain that copyright is a major consideration and factor here, and another way in which it's proving a disservice to the public good.
Copyright shouldn't be a consideration for the press. Not when dealing with newsworthy items, anyway. When it comes to public resources and government, I see no reason to not just link the real thing. Instead I see outlets like Vox post articles titled, "Biden's executive order on abortion, briefly explained" and so on, where we'll just have to trust that what they say is indeed true. It most likely is, of course, or they'd make fools of themselves. But when it comes from Vox, or Fox, it almost always comes with a slant.
For numerous reasons, it seems that there's an "is-ought" disconnect here, as is often the case.
Business tends to be exceedingly spooked by risk, especially long-tail unconstrained risk. And copyright litigation presents an excellent example of same.
There are also other concerns. In an era of physical print and shrinking "news holes", the actual textual content of newspapers tended to shrink, perhaps establishing a tradition of no longer printing speeches verbatim. With the attention economy of the Web, the risk of sending readers off-site is a concern I've heard voiced many times both in print and public discussions and privately amongst people I know in the press. It's quite unfortunate, but real.
The case of US Government documents, in which there is no copyright concern is especially inexcusable. I'd agree with you strongly there.
Increasingly when I find such an opinion piece, regardless of the publication, I look for the source document and try to read it first. (I don't always follow through, but it is if nothing else an aspirational goal.)
I haven't noticed any particular bias on Wikipedia, but I acknowledge that could be because it aligns with my own biases. Can you point out an example or two?
About the funding model, it would be great if normal donation works.
If not, I think the way some WEB3 projects are funded maybe an interesting inspiration (not talking about Ponzi scheme here). Many projects are "non profit" and sale tokens before the service is 100% ready. It funds the amelioration and scaling of the project. And the possibility to resale the tokens at a higher price in the future sometime attract token holders and often increase the "motivation" of the token holders / supporter of the project... fuel the community (money is only part of the motivation). Here token could be associated to symbolic "privileges" (badge, access to early releases), or governance (taking part of some votes).
This system have clearly some drawbacks, but allows sometimes to increase the number of early users and supporters, and get more funding while staying a non-profit.
You conflate two meanings of value: monetary value, and intrinsic value. Search engines are intrinsically but not monetarily valuable to users. Search engines are monetarily, but not intrinsically, valuable to advertisers. You can get into trouble when you conflate these two meaning of "value".
In fact, right here is the pivot on which the internet goes from an idealistic shang-ri-la for geeks, to a commercial hellscape for the unwashed masses. It is surprisingly easy to create intrinsic value with computers! You see it all day every day on HN: some geek had a thought, spends a weekend making it, and then deploys a solution.
It is surprisingly hard to extract monetary value from an intrinsically valuable solution. In fact, I believe that creating artificial scarcity is the hardest part of building an internet business, requiring invention on par with the intrinsically valuable part - and yet its the very thing that idealists rail against.
(And making something artificially scarce does seem morally repugnant. And yet I don't see any other way to pay developers. Full stop. Open source software + consulting fees is a good way to go, but that can't apply to hosted search for the public. Well I guess it could, you could teach businesses how to game your own engine!)