I really enjoy working with embedding. They’re truly fascinating as a representation of meaning - but also a very cheap and effective way to perform very cheap things like categorisation and clustering.
How would you approach using them in a specialized discipline (think technical jargon, acronyms etc) where traning a model from scratch is practically impossible because everyone (customers, solution providers) fiercely guards their data?
A generic embedding model does not have enough specificity to cluster the specialized terms or "code names" of specific entities (these differ across orgs but represent the same sets of concepts within the domain). A more specific model cannot be trained because the data is not available.
https://aws.amazon.com/blogs/machine-learning/use-language-e...
https://github.com/aws-samples/rss-aggregator-using-cohere-e...
I really enjoy working with embedding. They’re truly fascinating as a representation of meaning - but also a very cheap and effective way to perform very cheap things like categorisation and clustering.