Thanks for the great feedback:-) This is what searchmysite.net is attempting to do - help make "surfing the web" a fun leisure activity once more. It is good to see more people seem to get that point now. When it was on HN nearly 3 years ago[0], many people saw a search box and thought it must be a Google replacement, but were disappointed to find it wasn't. And I guess now more than ever it is useful to have a way of finding content on the web which has been made by humans rather than AI.
At a big corporate, we had an Apache Solr based search which had some reasonably clever lemmatization and stats analysis and spell check config to suggest alternative searches if not many results were found for the original query, but one day someone reported an unfortunate edge case which caused a bit of a panic - if you searched "annual report” it returned "did you mean anal report?" (we were in the finance sector rather than medical sector, but there were a lot more documents in the corpus containing words like analysts, analysis, analytics etc). Anyway, the point is yes, it is great to have that sort of functionality, but it does come at a cost, and a small project like this might prefer to keep it simple.
Generating suggestions from something other than what your users have already given you is inevitably going to result in something different and potentially offensive being shown to them.
One solution is to offer suggestion from a list of previous searches.
Also, that is very much a big corporate problem: I imagine most searchmysite users are mature and stable enough not to have a melt down at the word "anal".
But I agree with your point, sometimes seemingly small features take a disproportionate amount of support, and this could be one of them!
That's right. Most search engines are funded by advertising, where there is the clear conflict of interest[0], not to mention incentive for spam etc. Alternative models include a subscription fee (which I don't think would work for a small niche search like this) and donations (which may or may not be sustainable). Looking through some of the support forums for the big search engines, I'm pretty sure that enough site owners would pay a fee for support to pay the running costs for a large search engine, although for a smaller search engine like this there needs to be something more than just support, hence the search as a service features.
[0] "Advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers", to quote Sergey Brin and Lawrence Page in their "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper from 1998.
The LLM was for an experiment in retrieval augmented generation, i.e. "a chat with your website" style interface, using Apache Solr as the vector store. Results (on a small self-hosted LLM to keep costs manageable) weren't good enough for the functionality to be fully rolled out, so the LLM has been disabled and is likely to be fully removed.
Postgres is just used for the site admin, i.e. keeping track of submissions, review status, subscriptions etc. The actual search index is in Apache Solr. In theory you could use Solr to store all the admin data, but it is generally not recommended to use a Solr style document store to master data. I guess something more lightweight like SQLite could be used, but it is intended to be deployed on servers and Postgres isn't too resource intensive.
A couple of references to the Nazis, but no reference to the Nazi book burnings, an incredibly symbolic physical manifestation of knowledge and information destruction, which I'd have thought would be very relevant in this context, i.e. in the praise of physical books? Perhaps it wasn't mentioned because it doesn't quite fit in with the narrative of digital being all bad, given digital knowlege can be more resistant to suppression and physical destruction.
Also some great quotes from 30 years ago, e.g. Carl Sagan's "when awesome technological powers are in the hands of the very few" the nation would “slide, almost without noticing, back into superstition and darkness". But did it actually have to end up this way? And is it still possible (with enough collective will power) to push Big Tech profiteering back enough to deliver some of the society enhancing changes originally envisioned in the mid-1990s? Just as it took decades for the full positive implications of the invention of the printing press to come to fruition, perhaps we still need more time before we decry the internet as a net negative?
> an incredibly symbolic physical manifestation of knowledge and information destruction
Important distinction here, book burnings are an example of knowledge destruction, but not all information is knowledge, and not all knowledge is truth.
That is why this isn't applicable to the internet age, or in fact even the reverse is true. In an environment of digital mass communication there's much more information than knowledge, and the way to destabilize knowledge and truth is not to destroy knowledge but to flood you with information. This is why the most important skill today has shifted from finding knowledge to filtering out noise. The Nazi of today isn't going to hunt a library for a book, he's instead going to create an environment so entropic that truth and fiction become indistinguishable.
And that's also of course why you find people in that camp today as defenders of free flow of information. Because you need to realize that the signal to noise ratio has been turned on its head. When Google deletes 90% of my emails this isn't because they pursue evil plans like someone who burns 90% of a library down, quite the opposite, it's the only way I don't end up being scammed.
My children were given a soft toy a few years back from a relative who had bought it from a Chinese street market while on holiday in China. When it was switched on it jumped about frantically and sang a very loud and shrill song. Not 100% sure which language it was, but it is entirely possible it was some form of Chinese street music, and certainly fits the article's description of "Mainland Chinese recordings" as "shouty, harsh and ear-piercing". Normally my children love things that adults find annoying, but even they were afraid of this one.
Those are the good bots, which say who they are, probably respect robots.txt, and appear on various known bot lists. They are easy to deal with if you really want. But in my experience it is the bad bots you're more likely to want to deal with, and those can be very difficult, e.g. pretending to be browsers, coming from residential IP proxy farms, mutating their fingerprint too fast to appear on any known bot lists, etc.
Right, if you add up the named bots in my list it only comes to about 1.5k. But there's another 1-2k of bots per day pretending to be browsers but I am okay with that.
It's just the malicious ones I ban. And indeed I've banned nearly every hosting service in Wyoming (where shady companies don't have to list their benefactors and it's all malicious actor fronts) and huge ranges of Russian and Chinese IP space. My list of IP ranges banned is too long for a HN comment.
I used to work in an office which briefly had a commercially available 4KW microwave in the coffee area. I used to like it because it was fast. Unfortunately several other people failed to appreciate that you had to take the 800W timings and divide by 5, and it was quickly removed after several people set fire to their food.
This matches my experience. I ran one of my side-projects on AWS for a couple of years before switching to Hetzner - AWS was around £35 a month while Hetzner was around £7 a month, so Hetzner was around 80% cheaper for an equivalent service[0]. The other big thing was all the little costs in AWS - it took 2 months to get the AWS bill down to £0 due to all the hidden extras like backups and Elastic IP address.
[0] https://news.ycombinator.com/item?id=31395231