That works when you know what you’re going to say. A human knows when you’re pausing to think, but have a thought you’re in the middle of expressing. A VAD doesn’t know this and would interrupt when it hears a silence of N seconds; a lightweight LLM would know to keep waiting despite the silence.
And the inverse: the VAD would wait longer than necessary after a person says e.g. "What do you think?", in case they were still in the middle of talking.