The idea behind search itself is very simple, and it's a fun problem domain that I encourage anyone to explore[1].
The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.
A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.
> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.
Large amounts of data seem obviously difficult.
For your second difficulty, "handling underspecified queries": it seems to me that's a subset of the problem of, "given a query, what are the most relevant results?" That problem seems very tricky, partially because there is no exact true answer.
marginalia search is great as a contrast to engines like google, in part because google chooses to display advertisements as the most relevant results.
I think in today's world the harder problem is evading SEO spam. A search engine is in constant war with adverserarial players, who need you to see their content for revenue, rather than the actual answer.
This neccessitates a constant game of cat and mouse, where you adjust your quality metric so SEO shops can't figure it out and capitalise on it.
It sure helps, though there's still a lot of adversarial content you still need to deal with, so it's not a solved problem even if you remove the conflict of interest.
There are more kinds of search engines than just internet search engines. At this point I’m is almost certain that the non-internet search engines of the world are much larger than internet search engines.
Edit: And I’m getting downvoted for this. If it’s because I am tangential to the original comment then that’s fair. If it’s because you think I’m wrong, I have worked on the two largest internet search engines in the world and one non-internet search engine that dwarfed both in size (although different in complexity).
You’ve got to remember that google/bing do not index the internet entire. Part of their magic is selectively indexing only a tiny sliver and still being effective.
Other kinds of search systems have to index everything, which simplifies things but has its own scaling challenges.
Easiest way to think about it is that while the majority of webpages are never indexed, every blob of text in a social media post, private message in an app, email, document, etc in every major app in the world, including the ones with billions of users, is indexed in a search engine for that app:
- GSuite search (think of how many gmails are searchable in the world right now… and they are all indexed)
- the enterprise search powering ChatGPT, Claude (these maybe there by now, if not they are likely well on the way)
- The Microsoft 365 search (this is probably massive with so many corporate email systems and teams systems on it)
- slack search
- X(twitter) search
- ticktock search (this idk, I’ve never used ticktock but if every video and every comment is searchable then this is probably huge)
- Facebook search (especially since this is likely combined across its product suite)
These are probably all larger in effective size than google or bing.
What is the order of magnitude of the largest document store that you can practically work from SQLite on a single thousand-dollar server run by some text-heavy business process? For text search, roughly how big of a corpus can we practically search if we're occupying... let's say five seconds per query, twelve queries per minute?
If you held a gun to my head and forced me to make a guess I'd say you could push that approach to order of 100K, maybe 1M documents.
If sqlite had a generic "strictly ascending sequence of integers" type[1] and would optimize around that, you could probably push it farther in terms of implementing efficient inverted indexes.
From my experience, SQLite's FTS5 is orders of magnitude more performant than that, i.e. for 100K documents, 7 queries/second on some of the cheapest 1 vCPU Virtual Machines.
But it is true that a specialized search engine using a more clever algorithm might be another order of magnitude faster.
Thank you very much for the recommendation. I am in the process of building knowledge base bots, and am confronted with the task of creating various crawlers for the different sources the company has. And this book comes in very handy.
> The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.
I would expect the difficulty to be deciding which item to return when there are multiple that contain the search term. Is wikipedia's article on Gilligan's Island better than some guy's blog post? Or is that guy a fanatic who has spent his entire life pondering whether Wrongway Feldman was malicious or how Irving met Bingo Bango and Bongo?
Add in rank hacking, keyword stuffing, etc. and it seems like a very hard problem, while scaling... is scaling? ¯\_(ツ)_/¯
It’s not like ElasticSearch lacks ranking algorithms and control thereof. But it can require tuning and adjustment for various domains. Relevancy is, after all, subjective.
The difficulties in search are almost entirely dealing with the large amounts of data, both logistically and in handling underspecified queries.
A DBMS-backed approach breaks down surprisingly fast. Probably perfectly fine if you're indexing your own website, but will likely choke on something the size of English wikipedia.
[1] The SeIRP e-book is a good (free) starting point https://ciir.cs.umass.edu/irbook/