Embedditor is the open-source MS Word equivalent for embeddings pre-processing, that helps you get the most out of your vector search, while saving up to 30% on embedding and storage costs.
This solution is inspired by the experiences of over 30,000 IngestAI users. Our insights revealed a common bottleneck in AI and LLM-related applications, one that goes beyond LLM hallucinations or token limits, which are far easier to resolve. The prevailing issue lies in the GIGO (garbage in, garbage out) principle.
With no one-size-fits-all approach to chunking and embedding, certain models excel with individual sentences, while others thrive on chunks of 250 to 500 tokens. Blindly splitting chunks by the quantity of characters or tokens, and embedding content without normalization and with up to 30% of redundant noise (such as punctuations, stop-words, and low-relevance frequent terms) often leads to suboptimal vector search results and low-performing LLM-related applications using semantic or generative search. The issue was consisting in trying to enhance vector search using existing technologies, which proved to be as challenging for our users, as creating an outstanding document using a basic .txt format.
We decided to address the root problem, so we developed Embedditor - the Microsoft Word equivalent for embedding pre-processing, enabling with no background in data science or technical skills to improve performance of their vector search capabilities while saving up to 40% on embedding and storage. We've made Embedditor open-source and accessible to all because we genuinely believe that by improving vector search performance and boosting cost-efficiency simultaneously, Embedditor may have significant impact on current NLP and LLM industry.
>>>FEATURES
>>>Rich editor GUI
->Join and split one or multiple chunks with a few clicks;
->Edit embedding metadata and tokens;
->Exclude words, sentences, or even parts of chunks from embedding;
->Select the parts of chunk you want to be embedded;
->Add additional information to your mebeddings, like url links or images;
->Get a nice looking HTML-markup for your AI search results;
->Save your pre-processed embedding files in .veml or .jason formats;
>>>PRE-PROCESSING AUTOMATION
->Filteer out from vectorization most of the 'noise', like punctuations or stop-words;
->Remove from embedidng unsignificant, requently used words with TF-IDF algorithm;
->Normalize your embedding tokens before vectorization;
We’d like to present you the .veml [Vector Embedding Markup Language] that has the potential to greatly enhance the editing, use, and sharing of vector embeddings in various applications.
Adoption of VEML brings many benefits, like:
1. Standardization: VEML provides a standardized format for pre-processing and editing vector embeddings.
2. Interoperability: It ensures better interoperability among different applications and systems that utilize vector embeddings.
3. Extensibility: Just like XML, VEML has the potential to be extensible, allowing users to add new tags and attributes to represent additional properties or metadata associated with the vector embeddings.
5. Machine Readability: A well-defined markup language would also be easily parseable by ML, ensuring efficient processing and manipulation of vector embeddings by various software applications.
Which docs? I see just three files in that one repository…. One of which is an example file that doesn’t specify what embedding it is using. Do we have to use some specific embedding that is specified in your docs somewhere?
apologies for that. Let me check with my co-founder. It will be there. Would be great to know your thoughts about our GUI for editing and embeddings + joining and splitting chunks, as well as filtering out punctuations and stop-words with one word. You can have a look on it in the /embedditor repo or in our web embedditor. ai.
My take is you need to do more work on the value proposition.
My first take is that I can compute embeddings with one line of Python using sbeet.net and from there it is an automated process, I have a script that generates embeddings for 80,000 documents that runs every day and I barely think about it.
I think of GUI and I think somebody has to click through 80,000 documents to do this same and to get the same throughout I’d have to raise venture capital and hire an army of people to go click… click… click… That is it takes something easy and scalable and makes it difficult and expensive, It makes me think if the text retrieval experiments that Salton did with documents on IBM cards in the 1960s.
I know there is more to it than that, this simple approach is not so simple when you consider chunking and other choices that could make a big difference but i still think there would be some programming language function that takes a document and gives an embedding but some kind of suite to determine the parameters of that function (on the level of a document collection not individual documents) could be quite useful but I think a lot of people will want something that doesn’t have many knobs to turn.
hello Paul,
thanks a lot for your feedback. I really appreciate that and we'll take it into consideration for our next steps. We saw people struggling with vector search that retrieves half of the relevant paragraph, just because it was chunked base on the qty of tokens. So our first step is to give users (I'm not talking about people that know Python, NLTK, and LangChain) can pre-process their embeddings, adding there images, and making cleansing, removing at least punctuations and stop-words, with a few clicks. But you're totally right - now we have to think not about a single document pre-processing but about embedding large set of documents.
Really appreciate your time and hope to have your star or see you among our watchers.
We at IngestAI[1] are in similar space albeit with different focus - we answer from all internal documentation in Slack/Discord/Whatsapp. We did a show hn just a few weeks ago[2] & has been wild since then - we have close to 10k+ users and counting.
If you would like build slack/discord/whatsapp/telegram bots answer your customer questions from internal docs - please try us
The problem could still occur but you would have to be capturing all the queries to your internal LLM systems and then using that data for training. You have complete control of the model so you could just choose not to do that and I would think data leaks of this nature would be less of a concern for an internal environment anyway. You would know that only authorized individuals would have access to the data. I suppose there could still be a very small chance of leaking data to unauthorized employees, but if a rogue employee wants to access data they should not have access to fishing an LLM would probably be the least productive way to do that. Your access logs for the LLM system would clearly display the attempts.
Some commercial services are starting to offer "Enterprise" licenses that prohibit the collection and use for training of your data and that would address the concern as well.
would you be able to advise a good startup or service that provides good governance of Data Privacy management for startups? We would like to learn more and get this point as good as possible.. But you're right in some degree- we're builders, and we need help of professionals with data governance.
Ok so, you replied everyone on this sub-thread except the top-level comment. Why is that?
Your first task to improve your privacy policy is to review whether you really, absolutely, for reals, can require OpenAI to follow this: "they are obligated not to disclose or use the information for any other purpose."
Because, it looks like you can't, and OpenAI will absolutely use your customers' data for their own purposes, so you probably should remove this line from your privacy policy at minimum.
Sorry, I could not get what is the meaning of 12 ? From your knowledge base we LangChain it to get the right context and ask answer. How do you envisage a solution of this sort work ? Happy to learn and make it better.
thanks for your comment! Yep, we'd like to define it during our customer discovery, after we gather a lot of feedbacks, crystalize our understanding of the features users like and ask us to add. Maybe you'd be so kind to say how much this kind of products should cost from your POV ?
Embedditor is the open-source MS Word equivalent for embeddings pre-processing, that helps you get the most out of your vector search, while saving up to 30% on embedding and storage costs.
This solution is inspired by the experiences of over 30,000 IngestAI users. Our insights revealed a common bottleneck in AI and LLM-related applications, one that goes beyond LLM hallucinations or token limits, which are far easier to resolve. The prevailing issue lies in the GIGO (garbage in, garbage out) principle.
With no one-size-fits-all approach to chunking and embedding, certain models excel with individual sentences, while others thrive on chunks of 250 to 500 tokens. Blindly splitting chunks by the quantity of characters or tokens, and embedding content without normalization and with up to 30% of redundant noise (such as punctuations, stop-words, and low-relevance frequent terms) often leads to suboptimal vector search results and low-performing LLM-related applications using semantic or generative search. The issue was consisting in trying to enhance vector search using existing technologies, which proved to be as challenging for our users, as creating an outstanding document using a basic .txt format.
We decided to address the root problem, so we developed Embedditor - the Microsoft Word equivalent for embedding pre-processing, enabling with no background in data science or technical skills to improve performance of their vector search capabilities while saving up to 40% on embedding and storage. We've made Embedditor open-source and accessible to all because we genuinely believe that by improving vector search performance and boosting cost-efficiency simultaneously, Embedditor may have significant impact on current NLP and LLM industry.
>>>FEATURES
>>>Rich editor GUI
->Join and split one or multiple chunks with a few clicks;
->Edit embedding metadata and tokens;
->Exclude words, sentences, or even parts of chunks from embedding;
->Select the parts of chunk you want to be embedded;
->Add additional information to your mebeddings, like url links or images;
->Get a nice looking HTML-markup for your AI search results;
->Save your pre-processed embedding files in .veml or .jason formats;
>>>PRE-PROCESSING AUTOMATION
->Filteer out from vectorization most of the 'noise', like punctuations or stop-words;
->Remove from embedidng unsignificant, requently used words with TF-IDF algorithm;
->Normalize your embedding tokens before vectorization;
GitHub: https://github.com/embedditor/embedditor
We hope you love it, and we would love to hear your feedback.