Couldn't you could place a very light weight model in front to figure out which ...

sReinwald · 2025-05-08T08:02:07 1746691327

That’s a perfectly valid idea in theory, but in practice you’ll run into a few painful trade-offs, especially in multi-user environments. Trust me, I'm currently doing exactly that in our fairly limited exploration of how we can leverage local LLMs at work (SME).

Unless you have sufficient VRAM to keep all potential specialized models loaded simultaneously (which negates some of the "lightweight" benefit for the overall system), you'll be forced into model swapping. Constantly loading and unloading models to and from VRAM is a notoriously slow process.

If you have concurrent users with diverse needs (e.g., a developer requiring code generation and a marketing team member needing creative text), the system would have to swap models in and out if they can't co-exist in VRAM. This drastically increases latency before the selected model even begins processing the actual request.

The latency from model swapping directly translates to a poor user experience. Users, especially in an enterprise context, are unlikely to tolerate waiting for a minute or more just for the system to decide which model to use and then load it. This can quickly lead to dissatisfaction and abandonment.

This external routing mechanism is, in essence, an attempt to implement a sort of Mixture-of-Experts (MoE) architecture manually and at a much coarser grain. True MoE models (like the recently released Qwen3-30B-A3B, for instance) are designed from the ground up to handle this routing internally, often with shared parameter components and highly optimized switching mechanisms that minimize latency and resource contention.

To mitigate the latency from swapping, you'd be pressured to provision significantly more GPU resources (more cards, more VRAM) to keep a larger pool of specialized models active. This increases costs and complexity, potentially outweighing the benefits of specialization if a sufficiently capable generalist model (or a true MoE) could handle the workload with fewer resources. And a lot of those additional resources would likely sit idle for most of the time, too.

gustofied · 2025-05-08T13:01:49 1746709309

Have you looked into semantic router? It will be a faster way to look up the right model for the right task. I agree that using a llm for routing is not good, takes money, takes time, and can often take the wrong route.

sReinwald · 2025-05-08T16:55:39 1746723339

Semantic router is on my radar, but I haven't had a good look at it yet. The primary bottleneck in our current setup, isn't really the routing decision time. The lightweight LLM I chose (Gemma3 4B) handles the task identification fairly well in terms of both speed and accuracy from what I've found.

For some context: this is a fairly limited exploratory deployment which runs alongside other priority projects for me, so I'm not too obsessed with optimizing the decision-making time. Those three seconds are relatively minor when compared with the 20–60 seconds it takes to unload the old and load a new model.

I can see semantic router being really useful in scenarios built around commercial, API-accessed models, though. There, it could yield significant cost savings by, for example, intelligently directing simpler queries to a less capable but cheaper model instead of the latest and greatest (and likely significantly more expensive) model users might feel drawn to. You're basically burning money if you let your employees use Claude 3.7 to format a .csv file.

F-Lexx · 2025-05-08T07:53:53 1746690833

Good idea. Then you could place another lighter-weight model in front of THAT, to figure out which model to use in order to find out which model to use.

It,'s LLMs, all the way down.

the_clarence · 2025-05-08T20:36:34 1746736594

My guess is that this is basically what AI providers are slowly moving to. And this is what models seem to be doing underneath the surface as well now with Mixture of Experts (MoE).

promiseofbeans · 2025-05-08T11:37:51 1746704271

I mean, the general purpose models already do this in a way, routing to a selected expert. It's a pretty fundamental concept for ensemble learning, which is what MOE experts are, effectively.

I don't see any reason you couldn't stack more layers of routing in front, to select the model. However, this starts to seem inefficient.

I think the optimal solution will eventually be companies training and publishing hyper-focused expert models, that are designed to be used with other models and a router. Then interface vendors can purchase different experts and assemble the models themselves, like how a phone manufacter purchases parts from many suppliers, even their compeditors, in order to create the best final product. The bigger players (e.g. Apple for this analogy) might make more parts in house, but even the latest iPhone still has Samsung chips in it in teardowns.