Hacker Newsnew | past | comments | ask | show | jobs | submit | tifa2up's commentslogin

Interesting project. Curious why Electrobun over Tauri here? Tauri has a much larger ecosystem and rust based for improved performance.

Thank you! Main reason was DX. Tauri uses Rust for the backend while Electrobun uses TS + Bun. It's honestly been a breeze up till now.

Most people don't write Rust, so being in Rust is a disadvantage for those people. Most users don't care what language is used.

https://agentset.ai/

Open-source RAG infrastructure.Every team I talk to has the same experience: RAG works in the demo, breaks in production.

We handle ingestion through retrieval with optimizations baked in. 97.9% on HotpotQA vs 88.8% for standard RAG. Model-agnostic, 22+ file types, built-in citations, MCP server. MIT licensed.

https://github.com/agentset-ai/agentset



https://agentset.ai/leaderboard/embeddings good rundown of other open-source embedding models


I'm building https://github.com/agentset-ai/agentset, RAG as a service that works quite well out of the box.

We achieve this performance by baking in the best practices before any tweaking


How does it handle retrieval in a multi-turn conversation? Is there an intent graph involved?

Does it summarize past context or keep it all?


Right now it's single shot, we're looking into building an "Agentic Retrieval" based on Claude ADK. tbd how it'll work


So retrieve once on the first message, and then use that context for the rest of the conversation?


We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.


4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.


Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them


For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error


Interesting. https://www.robert-glaser.de/prompts-as-programs-in-gpt-5/ claims GPT-5 has amazing!1!! instruction following. Is your use-case very different, or is this yet another case of "developer A got lucky, developer B tested more things"?


Think it varies by use case. It didn't do well with long context


ChatGPT when using 5 or 5-Thinking doesn’t even follow my “custom instructions” on the web version. It’s a serious downgrade compared to the prior generation of models.


It does “follow” custom instructions. But more as a suggestion rather than a requirement (compared to other models)


Ah, 100k/125K this is what poses problems I believe. GPT-5 scores should go up should you process contexts that are 10 times shorter.


How do you objectively tell whether a model "performs" better than another?


Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.


> but I work in the space

Ya, the original commenter likely does not work in the space - hence the ask.

> the evaluation of new models is actually very quantitative.

While you may be able to derive a % correct (and hence quantitative), they are by their nature very much not quantitative. Q&As on written subjects are very much subjective. Example benchmark: https://llm-stats.com/benchmarks/gpqa Even though there are techniques to reduce overfitting, it still isn't eliminated. So it's very much subjective.


So… You did look back then didn’t look forward anymore… sorry couldn’t resist.


Don't solve it on the STT level. Get the raw transcription from Gemini then pass the output to an LLM to fix company names and other modifications.

Happy to share more details if helpful.


Yeah, I've done it with industry-specific acronyms and this works well. Generate a list of company names and other terms it gets wrong, and give it definitions and any other useful context. For industry jargon, example sentences are good, but that's probably not relevant for company names.

Feed it that list and the transcript along with a simple prompt along the lines of "Attached is a transcript of a conversation created from an audio file. The model doing the transcription has trouble with company names/industry terms/acronyms/whatever else and will have made errors with those. I have also attached a list of company names/etc. that may have been spoken in the transcribed audio. Please review the transcription, and output a corrected version, along with a list of all corrections that you made. The list of corrections should include the original version of the word that you fixed, what you updated it to, and where it is in the document." If it's getting things wrong, you can also ask it to give an explanation of why it made each change that it did and use that to iterate on your prompt and the context you're giving it with your list of words.


Which specific model do you use?


I've had some luck with this in other contexts. Get the initial transcript from STT (e.g. whisper), then feed that in to gemini with a prompt giving it as much extra context as possible. For example "This is a transcript from a youtube video. It's a conversation between x people, where they talk about y and z. Please clean up the transcript, paying particular attention to company names and acronyms."


I've done the same, it works very well.


Yes, we got 187 self-serve users (all on the free plan). And are in talks with an enterprise now.


You typically add a lot of metadata with each chunk text to be able to filter it, and do to include in the citations. Injecting metadata means that you see what metadata adds helpful context to the LLM, and when you pass the results to the LLM you pass them in a format like this:

Title: ... Author: ... Text: ...

for each chunk, instead of just passing the text


Quite a decent hit. Local models don't perform very well in long contexts. We're planning to support a local-only offline set-up for people to host w/o additional dependencies


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: