Hacker Newsnew | past | comments | ask | show | jobs | submit | MrScruff's commentslogin

> This is speculative, but I suspect that if we dropped one of the latest, most capable open-weight LLMs, such as GLM-5, into a similar harness, it could likely perform on par with GPT-5.4 in Codex or Claude Opus 4.6 in Claude Code.

Unless I'm misunderstanding what's being described here, running Claude Code with different backend models is pretty common.

https://docs.z.ai/scenario-example/develop-tools/claude

It doesn't perform on par with Anthropic's models in my experience.


> It doesn't perform on par with Anthropic's models in my experience.

Why do you think that is the case? Is Anthropic's models just better or do they train the models to somehow work better with the harness?


It is more common now to improve models in agentic systems "in the loop" with reinforcement learning. Anthropic is [very likely] doing this in the backend to systematically improve the performance of their models specifically with their tools. I've done this with Goose at Block with more classic post-training approaches because it was before RL really hit the mainstream as an approach for this.

If you want to look at some of the tooling and process for this, check out verifiers (https://github.com/PrimeIntellect-ai/verifiers), hermes (https://github.com/nousresearch/hermes-agent) and accompanying trace datasets (https://huggingface.co/datasets/kai-os/carnice-glm5-hermes-t...), and other open source tools and harnesses.


Here’s an explicit example of the above from today using the above dataset: https://x.com/kaiostephens/status/2040396678176362540?s=46

It's a good question, I've wondered that myself. I haven't used GLM-5 with CC but I've used GLM-4.7 a fair amount, often swapping back and forth with Sonnet/Opus. The difference is fairly obvious - on occasions I've mistakenly left GLM enabled running when I thought I was using Sonnet, and could tell pretty quickly just based on the gap in problem solving ability.

They're just dumber. I've used plenty of models. The harness is not nearly as important.

The harness if anything matters more with those other models because of how much dumber they are... You can compensate for some of the stupidity (but by no means all) with harnesses that tries to compensate in ways that e.g. Claude Code does not because it isn't necessary to do so for Anthropics own models.

I've found that on some projects maybe 70-80% of what can be done with Sonnet 4.6 in OpenCode can be done with a cheaper model like MiMo V2 Pro or similar. On others Sonnet completely outperforms. I'm not sure why. I only find Opus to be worth the extra cost maybe 5% of the time.

I also find OpenCode to be drastically better than Claude Code, to the extent that I'm buying OpenRouter API credits rather than Claude Max because Claude Code just isn't good enough.

I'm frankly amazed at what OpenCode can do with a few custom commands (just for common things like doing a quality review, etc.), and maybe an extra "agent" definition or two. For many projects even most of this isn't necessary. Often I just ask it to write an AGENTS.md that encapsulates a good development workflow, git branch/commit policy, testing and quality standards, and ROADMAP.md plus per milestone markdown files with phases and task tracking, and this is enough.

I'm somewhat interested in these more involved harnesses that automated or enforce more, but I don't know that they'd give me much that I don't have and I think they'd be tough to keep up with the state of the art compared to something less specific.


I've been playing with the open models since the original llama leak. They're getting better over time, are useful for tasks of moderate complexity and it's just cool to have a binary blob of knowledge that you can run locally without an internet connection.

However you should manage your expectations. Whatever the benchmarks say, you'll quickly realise they're not at all competing with Sonnet let alone Opus. Even the largest open weights models aren't really doing that.


Haven't really tried GLM5 much but I've used 4.7 quite a bit and it was pretty far from competing with Sonnet at the time, although I saw claims online to the contrary.

Calling everyone you disagree with a 'bro' doesn't make your point any more convincing.

Chill bro it’s just a joke. Sensitive.

I would say thinking about the indended audience for your creative outlet is a good discipline - even if it's only one person. It often gives the project more of a focus which helps with motivation and makes it more enjoyable.

Honestly a lot of useful software is ‘unimportant’ in the sense that the consequences of introducing a bug or bad code smell aren’t that significant, and can be addressed if needed. It might well be for many projects the time saved not reviewing is worth dealing with bugs that escape testing. Also, it’s entirely possible for software to be both well engineered and useless.

Exactly - not so much in "important" stuff.

Turns out there are whole categories of software where 'extremely fast and good enough' is what matters, even for skilled software developers.

I see a lot of people talk about 'insecure code' and while I don't doubt that's true, there's a lot of software development where security isn't actually a concern because there's no need for the software to be 'secure'. Maintainability is important I'll grant you.


Oh, finally someone who speaks the truth :D yeah, security su*ks everywhere. True. But when you grow a product with time, you fill the holes one by one when you start losing water. With AI, you will have so many holes in your 15 days made battleship, that... good luck putting it at sea, you can tank in the first minute. As Moltbook has shown. I don't have to find proofs (which I dont actually find). I just need a counter example and... first big vibe product I've seen, first gigantic security failure. Plain to see.

For sure, but I haven't written a single piece of software where security would ever be considered a factor. Not all software runs on the web, not all software deals with accounts etc.

I think this is too broad. If, for example, I get Claude to set up a fine tuning pipeline for rf-detr and it one shots it for me, what have I lost? A learning opportunity to understand the details of how to go about this process, sure. But you could argue the same about relying on PyTorch. Ultimately we all have an overarching goal when engaged in these projects and the learning opportunity might be happening at an entirely different level than worrying about the nuts and bolts of how you build component A of your larger project.


Yeah, I used to enjoy writing code but after a while I realised I actually more enjoy creating tools that I (and other people) liked to use. Now I can do that really quickly even with my very limited free time, at a higher level of abstraction, but it's still me designing the tool.

And despite the amount of people telling me the code is probably awful, the tools work great and I'm happily using them without worrying about the code anymore than I worry about the assembly generated by a compiler.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: