Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't know how complete the digitization of old texts is, but if you go to worldwide.espacenet.com, search for "airship" and reverse sort by date you get documents from the 1880s.

In fact I'm downloading a whole batch of patent texts right now because I wanted to experiment with semantic search on patent texts.

Anyone here have any pointers on what the state of the art method for semantic search through a large corpus would be? I've just started researching and BERT and friends seems like it was popular about 2 years ago but things move so fast I wouldn't know what I should do now.

What about a medium sized corpus of text, say 100.000 pages of text?



afaik sentence embeddings via sbert are still considered a pretty viable path. This may be what you were already looking at, but there's more info here: https://www.sbert.net/index.html




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: