Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Am I understanding it right that for each new text (tweet) you generate its embedding first, try to match across existing vector embeddings for all other text (full text or bag of words), and then send the text to the LLM for tag classification only if no match is found or otherwise classify it to the same tag for which a match was found.

Will it be any better if you sent a list of existing tags with each new text to the LLM, and asked it to classify to one of them or generate a new tag? Possibly even skipping embeddings and vector search altogether.





Yeah. But I think one problem could be if this list is very large and overflows the context.

I was thinking giving the LLM a tool `(query: string) => string[]` to retrieve a list of matching labels to check if they already exist.

But the above approach sounds similar to OP, where they use embeddings to achieve that.


I think your understanding is correct.

I actually built a project for tagging posts exactly the way you described.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: