Hacker News new | past | comments | ask | show | jobs | submit login

How does this choose when to speak back? (Like is it after a pause, or other heuristics.) I tried looking through the source to find this logic.



It waits for sufficient silence to determine when to stop recording the voice and send it to the model. There is other modes in the source as well and methods of setting the length of silences in order to chunk up and send bits at a time, but I imagine that is either work in progress or not planned for this demo.


Thanks

I was surprised they didn’t combine this work with the streaming whisper demo. So I guess I will implement that for iOS/macos (streaming whisper results in realtime without waiting on an audio pause, but as you say using the audio pauses and other signals like punctuation in the result to determine when to llm complete; makes me also wonder about streaming whisper results in to the llm incrementally before ready for completion)


It may be using the streaming demo. The reason I know to answer your question is that I had modified the streaming demo myself for personal use before. I think there is bugs in the silence detection code (as of a few months back, maybe fixed now). Maybe what we are seeing in this demo is just the "silence detection" setting to be waiting for very long pauses, I believe its configurable.


I added libfvad




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: