We're using llama.cpp. We use all kinds of different models other than Qwen3, and vLLM startup when switching models is prohibitively slow (several times slower than llama.cpp, which is already 5 sec)
From what I understand, vLLM is best when there's only 1 active model pinned to the GPU and you have many concurrent users (4, 8 etc.). But with just a single 32 GB GPU you have to switch the models pretty often, and you can't fit more than 2 concurrent users anyway (without sacrificing the context length considerably: 4 users = just 16k context, 8 users = 8k context), so I think vLLM so far isn't worth it. Once we have several cards, we may switch to vLLM.