> I think something like 96GB RTX PRO 6000 Blackwells would be the minimum to run a model of this size with performance in the range of subscription models.
GLM 5.1 has 754B parameters tho. And you still need RAM for context too. You'll want much more than 96GB ram.
H1b is tied to employment, not to the employer. You can change employers on the same H1.
It’s not great. But this is similar to how health insurance is tied to employment, not to the employer. Both citizens and H1 employees experience the same abuse here
> Every MCP server injects its full tool schemas into context on every turn
I consider this a bug. I'm sure the chat clients will fix this soon enough.
Something like: on each turn, a subagent searches available MCP tools for anything relevant. Usually, nothing helpful will be found and the regular chat continues without any MCP context added.
I'll add to your comment that it isn't a bug of MCP itself. MCP doesn't specify what the LLM sees. It's a bug of the MCP client.
In my toy chatbot, I implement MCP as pseudo-python for the LLM, dropping typing info, and giving the tool infos as abruptly as possible, just a line - function_name(mandatory arg1 name, mandatory arg2 name): Description
(I don't recommend doing that, it's largely obsolete, my point is simply that you feed the LLM whatever you want, MCP doesn't mandate anything. tbh it doesn't even mandate that it feeds into a LLM, hence the MCP CLIs)
Yup, routing is key. Just like how we've had RAG so we don't have to add every biz doc to the context.
I agree with the general idea that models are better trained to use popular cli tools like directory navigation etc, but outside of ls and ps etc the difference isn't really there, new clis are just as confusing to the model as new mcps.
I don’t think so. Without a list of tools in context the ai can’t even know what options it has, so a RAG like search doesn’t feel like it would be anywhere near as accurate
Interesting pricing differential. Seems in your country, that IdeaPad is significantly cheaper than the price in the US. But for your Macbook Neo, it's the other way around.
No idea. Maybe Lenovo includes purchasing power in the price calculation for some reason, such as making more money in the U.S. while gaining market share here in Czechia, where purchasing power is lower. Apple may be able to afford not to do that.
But qwen3.5 35b is worse than even Claude Haiku 4.5. You could switch your Claude Code to use Haiku and never hit rate limits. Also gets similar 50tps.
I haven't tried 4.5 haiku much, but i was not impressed with previous haiku versions.
My goto proprietary model in copilot for general tasks is gemini 3 flash which is priced the same as haiku.
The qwen model is in my experience close to gemini 3 flash, but gemini flash is still better.
Maybe it's somewhat related to what we're using them for. In my case I'm mostly using llms to code Lua. One case is a typed luajit language and the other is a 3d luajit framework written entirely in luajit.
I forgot exactly how many tps i get with qwen, but with glm 4.7 flash which is really good (to be local) gets me 120tps and a 120k context.
Don't get me wrong, proprietary models are superior, but local models are getting really good AND useful for a lot of real work.
GLM 5.1 has 754B parameters tho. And you still need RAM for context too. You'll want much more than 96GB ram.
reply