> "[..] in developing our reasoning models, we’ve optimized somewhat less for ma...

eschluntz · 2025-02-24T19:59:50 1740427190

Thanks! We all dogfood Claude every day to do our own work here, and solving our own pain points is more exciting to us than abstract benchmarks.

Getting things done require a lot of booksmarts, but also a lot of "street smarts" - knowing when to answer quickly, when to double back, etc

jasonjmcghee · 2025-02-24T20:37:47 1740429467

Just want to say nice job and keep it up. Thrilled to start playing with 3.7.

In general, benchmarks seem to very misleading in my experience, and I still prefer sonnet 3.5 for _nearly_ every use case- except massive text tasks, which I use gemini 2.0 pro with the 2M token context window.

jasonjmcghee · 2025-02-24T22:00:01 1740434401

An update: "code" is very good. Just did a ~4 hour task in about an hour. It cost $3 which is more than I usual spend in an hour, but very worth it.

martinald · 2025-02-24T20:41:50 1740429710

I find the webdev arena tends to match my experience with models much more closely than other benchmarks: https://web.lmarena.ai/leaderboard. Excited to see how 3.7 performs!

LouisSayers · 2025-02-24T20:08:13 1740427693

Could you tell us a bit about the coding tools you use and how you go about interacting with Claude?

catherinewu · 2025-02-24T20:28:28 1740428908

We find that Claude is really good at test driven development, so we often ask Claude to write tests first and then ask Claude to iterate against the tests

Kerrick · 2025-02-24T20:48:36 1740430116

Write tests (plural) first, as in write more than one failing test before making it pass?

zarmin · 2025-02-25T01:40:03 1740447603

Time to look up TDD, my friend.

DrammBA · 2025-02-25T15:22:29 1740496949

One of today's lucky 10,000. His mind is about to expand beyond imagination.

DrammBA · 2025-02-26T02:37:58 1740537478

I wish I could delete my original comment now that I found out that Kerric wasn't a lucky 10,000, he's just an asshole...

zarmin · 2025-02-26T05:35:41 1740548141

Well, you lucky-10,000'd people who didn't know about the 10,000 thing. That's not nothing.

Kerrick · 2025-02-25T18:03:30 1740506610

Time to actually read Test-Driven Development By Example, my friend. Or if you can't stomach reading a whole book, read this: https://tidyfirst.substack.com/p/canon-tdd

TL;DR - If you're writing more than one failing test at a time, you are not doing Test-Driven Development.

zarmin · 2025-02-25T23:24:20 1740525860

oh my god, your comment was just a setup for you to be pedantic? all discourse on the internet is worthless. i don't know why i keep engaging.

crowcroft · 2025-02-24T19:48:13 1740426493

Sometimes I wonder if there is overfitting towards benchmarks (DeepSeek is the worst for this to me).

Claude is pretty consistently the chat I go back to where the responses subjectively seem better to me, regardless of where the model actually lands in benchmarks.

ben_w · 2025-02-24T20:01:45 1740427305

> Sometimes I wonder if there is overfitting towards benchmarks

There absolutely is, even when it isn't intended.

The difference between what the model is fitting to and reality it is used on is essentially every problem in AI, from paperclipping to hallucination, from unlawful output to simple classification errors.

(Ok, not every problem, there's also sample efficiency, and…)

FergusArgyll · 2025-02-24T22:36:07 1740436567

Ya, Claude crushes the smell test

bicx · 2025-02-24T19:18:45 1740424725

Claude 3.5 has been fantastic in Windsurf. However, it does cost credits. DeepSeek V3 is now available in Windsurf at zero credit cost, which was a major shift for the company. Great to have variable options either way.

I’d highly recommend anyone check out Windsurf’s Cascade feature for agentic-like code writing and exploration. It helped save me many hours in understanding new codebases and tracing data flows.

throwup238 · 2025-02-24T19:38:45 1740425925

DeepSeek’s models are vastly overhyped (FWIW I have access to them via Kagi, Windsurf, and Cursor - I regularly run the same tests on all three). I don’t think it matters that V3 is free when even R1 with its extra compute budget is inferior to Claude 3.5 by a large margin - at least in my experience in both bog standard React/Svelte frontend code and more complex C++/Qt components. After only half an hour of using Claude 3.7, I find the code output is superior and the thinking output is in a completely different universe (YMMV and caveat emptor).

For example, DeepSeek’s models almost always smash together C++ headers and code files even with Qt, which is an absolutely egregious error due to the meta-object compiler preprocessor step. The MOC has been around for at least 15 years and is all over the training data so there’s no excuse.

SkyPuncher · 2025-02-24T20:22:04 1740428524

I've found DeepSeek's models are within a stone's throw of Claude. Given the massive price difference, I often use DeepSeek.

That being said, when cost isn't a factor Claude remains my winner for coding.

rubymamis · 2025-02-24T20:23:37 1740428617

Hey there! I’m a fellow Qt developer and I really like your takes. Would you like to connect? My socials are on my profile.

throwup238 · 2025-02-24T20:44:21 1740429861

We’ve already connected! Last year I think, because I was interested in your experience building a block editor (this was before your blog post on the topic). I’ve been meaning to reconnect for a few weeks now but family life keeps getting in the way - just like it keeps getting in the way of my implementing that block editor :)

I especially want to publish and send you the code for that inspector class and selector GUI that dumps the component hierarchy/state, QML source, and screenshot for use with Claude. Sadly I (and Claude) took some dumb shortcuts while implementing the inspector class that both couples it to proprietary code I can’t share and hardcodes some project specific bits, so it’s going to take me a bit of time to extricate the core logic.

I haven’t tried it with 3.7 but based on my tree-sitter QSyntaxHighlighter and Markdown QAbstactListModel tests so far, it is significantly better and I suspect the work Anthropic has done to train it for computer use will reap huge rewards for this use case. I’m still experimenting with the nitty gritty details but I think it will also be a game changer for testing in general, because combining computer use, gammaray-like dumps, and the Spix e2e testing API completes the full circle on app context.

rubymamis · 2025-02-25T08:48:16 1740473296

Oh how cool! I'd love to see your block editor. A block editor in Qt C++ and QMLs is a very niche area that wasn't explored much if at all (at least when I first worked on it).

From time to time I'm fooling with the idea of open sourcing the core block editor but I don't really get into it since 1. I'm a little embarrassed by the current unmodularization of the code and want to refactor it all. 2. I want to still find a way to monetize my open source projects (so maybe AGPL with commercial license?)

Dude, that inspector looks so cool. Can't wait to try it. Do you think it can also show how much memory each QML component is taking?

I'm hyped as well about Claude 3.7, haven't had the time to play with it on my Qt C++ projects yet but will do it soon.

bionhoward · 2025-02-24T20:21:26 1740428486

The big difference is DeepSeek R1 has a permissive license whereas Claude has a nightmare “closed output” customer noncompete license which makes it unusable for work unless you accept not competing with your intelligence supplier, which sounds dumb

Aeolun · 2025-02-24T23:35:09 1740440109

Do most people have an expectation of competing with Claude?

ein0p · 2025-02-24T23:44:39 1740440679

Some of the people who use Claude for coding work on products involving AI. I don't know what percentage, but I bet it's not trivial.

woah · 2025-02-25T03:41:30 1740454890

Seems like that must make it impossible for the Cursor devs to use their own product given that Claude is the default there

tonyhart7 · 2025-02-24T19:59:08 1740427148

I seen people switch from claude due to cost to another model notably deepseek tbh I think it still depends on model trained data on

ai-christianson · 2025-02-24T19:26:24 1740425184

I'm working on an OSS agent called RA.Aid and 3.7 is anecdotally a huge improvement.

About to push a new release that makes it the default.

It costs money but if you're writing code to make money, it's totally worth it.

newgo · 2025-02-24T20:06:21 1740427581

How is it possible that deepseek v3 would be free? It costs a lot of money to host models