I like open source software. I like what you are doing and keeping development o...

yuhongsun · on Feb 22, 2024

Hello, thanks for the kind words! With regards to your questions:

1. Are you referring to the local NLP models or the LLM? The local models are already open source models or ones we've trained ourselves. If you're talking about the LLM, the default is OpenAI but it's easy to configure other ones without any code changes.

2. Most sources are pulled from every 10 minutes. They have incremental updates so if you have Confluence with a million pages, probably in the last 10 minutes, only a dozen or so have been updated. The only exception is websites (which are crawled recursively so we don't know which pages are updated before we try), which is updated once a day.

3. Glean is indeed similar. Without going into the features in detail, we are an open source Glean with more of an emphasis on LLMs and Chat.

4. There's generally not a great reason to build from scratch if an open source alternative with +75% alignment exists. They can always build on top of us if they want. A lot of teams reach out to us because they were looking to switch from their in house solution to Danswer. Generally though these are larger teams, we haven't seen many SMBs building RAG for their own usage, usually these smaller teams building RAG are looking to productize.

5. Currently there is no cheap and fast way to fine-tune LLMs every time a document is updated. If you want an LLM to remember the document that was just updated you'd have to augment it to at least dozens of similar (but all correct) examples. RAG is still the only viable option. Then there is the problem of security etc. since you can't enforce user roles at the LLM level. So companies that focus on building LLMs don't really compete in this specific space and they don't want to either as they're trying to build AGI. There is more of a threat from teams like Microsoft and Google who are indeed trying to build knowledge assistants for their product lines, but we think there is a world where open source ends up winning against the giants!

dkarras · on Feb 23, 2024

how is it "chat over private data" if you are exposing my data to more parties like openai? I thought you were using a stack of self hosted open weight LLMs etc. If I can send it elsewhere, it is not private data.

yuhongsun · on March 2, 2024

So private refers to two things here, sorry for any confusion.

When we say "chat over private data" we mean that this data isn't publicly available and no LLMs have this knowledge in their training. Meaning that with our system you can now ask questions about team specific knowledge. For example, you can ask questions like "What features did customer X ask about in our last call". Obviously if you ask ChatGPT this, it will have no idea.

The other part is data privacy when using the system. The software can be plugged into most LLM providers or locally running LLMs. So if your team doesn't trust OpenAI but instead has a relationship with say Azure, or GCP, you can just plug into one of those instead. Alternatively, a lot of users recently have been setting up Danswer with locally running LLMs with tools like Ollama. In that case, you now have a truely airgapped system where no data is ever going outwards.