While I am rooting for Mistral, having access to a diverse set of models is the killer app IMHO. Sometimes you want to code. Sometimes you want to write. Not all models are made equal.
Tbh I think the one general model approach is winning. People don't want to figure out which model is better at what unless its for a very specific task.
IMHO people want to interact with agents that do things not with models that chat. And agents by definition are specialised which means a specific model and Mistral might not be good for all types of tasks just like the top of line models are not always for everything.
That’s a perfectly valid idea in theory, but in practice you’ll run into a few painful trade-offs, especially in multi-user environments. Trust me, I'm currently doing exactly that in our fairly limited exploration of how we can leverage local LLMs at work (SME).
Unless you have sufficient VRAM to keep all potential specialized models loaded simultaneously (which negates some of the "lightweight" benefit for the overall system), you'll be forced into model swapping. Constantly loading and unloading models to and from VRAM is a notoriously slow process.
If you have concurrent users with diverse needs (e.g., a developer requiring code generation and a marketing team member needing creative text), the system would have to swap models in and out if they can't co-exist in VRAM. This drastically increases latency before the selected model even begins processing the actual request.
The latency from model swapping directly translates to a poor user experience. Users, especially in an enterprise context, are unlikely to tolerate waiting for a minute or more just for the system to decide which model to use and then load it. This can quickly lead to dissatisfaction and abandonment.
This external routing mechanism is, in essence, an attempt to implement a sort of Mixture-of-Experts (MoE) architecture manually and at a much coarser grain. True MoE models (like the recently released Qwen3-30B-A3B, for instance) are designed from the ground up to handle this routing internally, often with shared parameter components and highly optimized switching mechanisms that minimize latency and resource contention.
To mitigate the latency from swapping, you'd be pressured to provision significantly more GPU resources (more cards, more VRAM) to keep a larger pool of specialized models active. This increases costs and complexity, potentially outweighing the benefits of specialization if a sufficiently capable generalist model (or a true MoE) could handle the workload with fewer resources. And a lot of those additional resources would likely sit idle for most of the time, too.
Have you looked into semantic router?
It will be a faster way to look up the right model for the right task.
I agree that using a llm for routing is not good, takes money, takes time, and can often take the wrong route.
Semantic router is on my radar, but I haven't had a good look at it yet. The primary bottleneck in our current setup, isn't really the routing decision time. The lightweight LLM I chose (Gemma3 4B) handles the task identification fairly well in terms of both speed and accuracy from what I've found.
For some context: this is a fairly limited exploratory deployment which runs alongside other priority projects for me, so I'm not too obsessed with optimizing the decision-making time. Those three seconds are relatively minor when compared with the 20–60 seconds it takes to unload the old and load a new model.
I can see semantic router being really useful in scenarios built around commercial, API-accessed models, though. There, it could yield significant cost savings by, for example, intelligently directing simpler queries to a less capable but cheaper model instead of the latest and greatest (and likely significantly more expensive) model users might feel drawn to. You're basically burning money if you let your employees use Claude 3.7 to format a .csv file.
Good idea. Then you could place another lighter-weight model in front of THAT, to figure out which model to use in order to find out which model to use.
My guess is that this is basically what AI providers are slowly moving to. And this is what models seem to be doing underneath the surface as well now with Mixture of Experts (MoE).
I mean, the general purpose models already do this in a way, routing to a selected expert. It's a pretty fundamental concept for ensemble learning, which is what MOE experts are, effectively.
I don't see any reason you couldn't stack more layers of routing in front, to select the model. However, this starts to seem inefficient.
I think the optimal solution will eventually be companies training and publishing hyper-focused expert models, that are designed to be used with other models and a router.
Then interface vendors can purchase different experts and assemble the models themselves, like how a phone manufacter purchases parts from many suppliers, even their compeditors, in order to create the best final product. The bigger players (e.g. Apple for this analogy) might make more parts in house, but even the latest iPhone still has Samsung chips in it in teardowns.
Same here. Since I started using LLMs a bit more, the killer step for me was to set up API access to a variety of providers (Mistral, Anthropic, Gemini, OpenAI), and use a unified client to access them. I'm usually coding at the CLI, so I installed 'aichat' from github and it does an amazing job. Switch models on the fly, switch between one-shot and session mode, log everything locally for later access, and ask casual questions with a single quick command.
I think all providers guarantee that they will not use your API inputs for training, it's meant as the pro version after all.
Plus it's dirt cheap, I query them several times per day, with access to high end thinking models, and pay just a few € per month.
Gemini's free tier will absolutely use your inputs for training [1], same with Mistral's free tier [2]. Anthropic and OpenAI let's you opt into data collection for discounted prices or free tokens.
Yeah, I mean paid API access. You put a credit card in, and it's peanuts at the end of the month. Sorry I didn't specify. Good reminder that with free services you are the product!