Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

These vectors are lower-dimensional than traditional vectors though, aren't they? Vector embeddings are in the hundreds to low thousands range of dimensions (roughly between 128-1024), whereas TF-IDF has the same dimension as your vocabulary. It's also not just about being flat-out better, but about increasing the recall of queries, as you're grabbing content that doesn't contain the keywords directly, but is still relevant. You are also free to mix the two approaches together in one result set, which gives the best of both.



The problems with dimensionality certainly show up even with 256 dimensions. PCA-ing down to a few hundred dimensions is still a problem, and then you have to deal with PCA lossiness too!


Nobody used TF-IDF for vector lookups without applying a PCA first though.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: