A Claude Code setup that implements ML papers from arxiv. Give it a paper, it orchestrates a team of AI agents to read the paper, plan the implementation, write the code, verify correctness, optimize performance, train, and compare results against the paper's claims.
> Fast mode is not a different model. It uses the same Opus 4.6 with a different API configuration that prioritizes speed over cost efficiency. You get identical quality and capabilities, just faster responses.
They failed to grasp the very fundamental point of batching, which is sharing model weights between requests. For more context, this wasn't just one person's mistake, several AI twitter personalities proposed this 'Claude Opus fast = small batching' hypothesis. What I find funny is how confident these AI influencers were, while the people who actually work on LLM serving at frontier labs said nothing. The people who genuinely understand this and work at frontier labs stay quiet. The rest is simply noise.
If you ask someone knowledgeable at r/LocalLLaMA about an inference configuration that can increase TG by *up to* 2.5x, in particularly for a sample prompt that reads "*Refactor* this module to use dependency injection", then the answer is of course speculative decoding.
You don't have to work for a frontier lab to know that. You just have to be GPU poor.
I do think Claude Code as a tool gave Anthropic some advantages over others. They have plan mode, todolist, askUserQuestion tools, hooks, etc., which greatly extend Opus's capabilities. Agree that others (Codex, Cursor) also quickly copy these features, but this is the nature of the race, and Anthropic has to keep innovating to maintain its edge over others
The biggest advantage by far is the data they collect along the way. Data that can be bucketed to real devs and signals extracted from this can be top tier. All that data + signals + whatever else they cook can be re-added in the training corpus and the models re-trained / version++ on the new set. Rinse and repeat.
(this is also why all the labs, including some chinese ones, are subsidising / metoo-ing coding agents)
(I work at Cursor) We have all these! Plan mode with a GUI + ability to edit plans inline. Todos. A tool for asking the user questions, which will be automatically called or you can manually ask for it. Hooks. And you can use Opus or any other models with these.
I did something similar in Python, in case people want to see a slightly different perspective (I was aiming for a minimal agent library with built-in tools, similar to the Claude Agent SDK):
For one, these models should be able to understand the physical world via images, audio, and video. I do agree that current models are quite good at coding, but that's mainly because coding is entirely text-based and easily verifiable. It's not obvious that this capability will transfer to other domains that aren't text-based and aren't as easily verifiable.
I'm not familiar with these open-source models. My bias is that they're heavily benchmaxxing and not really helpful in practice. Can someone with a lot of experience using these, as well as Claude Opus 4.5 or Codex 5.2 models, confirm whether they're actually on the same level? Or are they not that useful in practice?
P.S. I realize Qwen3-Max-Thinking isn't actually an open-weight model (only accessible via API), but I'm still curious how it compares.
I don't know where your impression about benchmaxxing comes from. Why would you assume closed models are not benchmaxxing? Being closed and commercial, they have more incentive to fake it than the open models.
You are not familiar, yet you claim a bias. Bias based on what? I use pretty much just open-source models for the last 2 years. I occasionally give OpenAI and Anthropic a try to see how good they are. But I stopped supporting them when they started calling for regulation of open models. I haven't seen folks get ahead of me with closed models. I'm keeping up just fine with these free open models.
Yeah, I get there's nuance between all of them. I ranked Minimax higher for its agentic capabilities. In my own usage, Minimax's tool calling is stronger than Deepseek's and GLM.
My observation is that vibe-coded applications are significantly lower quality than traditional software. Anthropic software (which they claim to be 90% vibe coded) is extremely buggy, especially the UI.
That's a misunderstanding based on loose definition of "vibe coding". When companies threw around the "90% of code is written by AI" claims, they were referring to counting characers of autocomplete basing on users actually typing code (most of which was eequivalent to "AI generated" code by Eclipse tab-completion decade ago), and sometimes writing hyperlocal prompts for a single method.
We can identify 3 levels of "vibe coding":
1. GenAI Autocomplete
2. Hyperlocal prompting about a specific function. (Copilot's orginal pitch)
3. Developing the app without looking at code.
Level 3 is hardly considered "vibe" coding, and Level 2 is iffy.
"90% of code written by AI" in some non-trivial contexts only very recently reached level 3.
I don't think it ever reached Level 2, because that's just a painfully tedious way of writing code.
They have not said that. They've only said that most of their code is written by Claude. That is different than "vibe coding". If competent engineers review the code then it is little different than any coding.
IIRC, the Claude Code creator mentioned that all the PRs are reviewed by humans, just like normal human PRs. So yes, humans still look at the code at the review stage. Though I still consider this to be level 3, but anyway, this is just a matter of definition.
I mostly work at level 2, and I call it "power coding", like power armor, or power tools. Your will and your hand still guides the process continuously. But now your force is greatly multiplied.
reply