Hacker Newsnew | past | comments | ask | show | jobs | submit | berkes's commentslogin

I've been using devstral2 with great success for a few months now. The hosted version, not running one locally or such. Devstral is open.

Devstral is good, Opus better. But not much. For me, "good" is "good enough". The difference, IME lies in context engineering: skills, agents.md, subagents, tools, prompts. A Devstral with good skills performs far better than an "blank" claude code. Claude with good skills performs even better, but hardly noticable, IME.

I am convinced I've plateaued. Better performance comes from improving skills and other "memory", prompting smarter, better context management and, above all, from the tooling around it and the stability of the services.

I do still run Claude with Opus alongside Mistral with Devstral2. Sometimes to just compare outputs, often to doublecheck, but mostly to doublecheck my statement that the difference between Devstral2 and Opus is marginally and easily covered by better context engineering.


Perhaps. I’d like to like Devstral because I’d rather give my money to an European business.

My experience with it in an existing codebase has been that it gets to results much more reliably than Gemini Flash or Haiku, but it will cut corners and write incomprehensible code even with a good Opus plan to boot.

It’s true that the context and tooling might help, but setting everything up and finding the arcane mix of correct MCPs/skills is a job in itself right now. What I do see is that I’ve wasted months trying to get good code out of Gemini, Devstral2, and a good experience out of stuff like OpenCode and everything under the sun.


> is a job in itself right now.

Yes, exactly. I consider this the core of my job now: herding agents.

I reminds me of the time that I "herded" juniors, interns and new hires very much.

And my experience is that OpenCode et.al. don't do a "Good Enough" job. It's better, than e.g. Devstral2, but without guidance, still not sufficient. I think that mostly has to do with a combination of my experience and standards and of my languages and niches.

All of them are good enough for throwing out a react spagetti, one you'd expect from fiverr or from an intern: don't look under the hood, just drive it (launch it and leave it). Claude is far better in such a "benchmark" than e.g. Devstral2.

But when I need a hexagonal-architectured, TDD and BDD covered microservice in python with zero type warnings, all models fail spectacularly out of the box. I presume their training body isn't "used" to such patterns: it's statistically unlikely to ignore type warnings in Python (wink). Just like it's statistically unlikely to write a few files of typescript for a feature, instead of pulling in an node package. Turns out esp. with claude code, it's statistically likely to comment out failing tests if the rule is "ensure all test pass" and this one hard to fix¹.

So to get this level of what we require, I need tons of rules, guidelines, skills and whatnot. On every model. So I'll just as well - indeed - pipe my money into an EU company that's cheaper and has the option of self-hosting when s* starts hitting fans.

--- ¹ I think I finally found the "context" to fix this, though. What I used to tell my interns/juniors is to take a step back and re-think the shape of things: a difficult or complex test usually means the code it is testing needs re-architecturing. Something most agents will refuse: and good, because it's side-tracking them. My solution is to tell agents to stop, document the problem, and if obvious, document the solution as well in a dedicated "technical debt" markdown file. Then in future I'll direct another agent at this file and tell it to start fixing them one at a time.


I agree with all you’ve said.

Gemini loves deleting tests as well, and all of them will relentlessly stub things to make unit tests ‘easy’.

What experience brought me is knowing where to steer them, e.g. scraping all their shitty glue code and hand-holding Sonnet into implementing classes, DI, and unit tests that aren’t brittle at all. In that way, the agents have been nice to work with: they remind us of why cleaner code and good practices make for maintainable code. I hate their React spaghetti, but most places I’ve worked had tons of React spaghetti anyway…

All of this said: I actually miss steering juniors instead. Humans are frustrating to work with, but they are also adaptable, grow with time, and are… you know, human.

Mentoring Claude isn’t exactly fun or rewarding, in the way mentoring a colleague would be. And thankfully we have memory MCP servers, otherwise it would be like mentoring a brand new intern every time you fire up Claude.


Someone just asked my what I dislike most about Mistral and about Claude code.

I run both in zed editor. Claude codes' integration is subpar - it's ACP does not report tasks, doesn't give diffs and so on.

Mistral has rate limits that I hit just too often. I'm now using Mistral Pro, where this is worse, using pay-as-you-go is better but costs me 10x the pro. The agent then stops with an error.


Yes. It strikes me as odd how many people will put forward Python with the argument of "simplicity".

It is not. Simple. It may be "easy" but easy != simple (simple is hard, I tend to say).

I'm currently involved in a project that was initially layed out as microservices in rust and some go, to slowly replace a monolyth Django monstrosity of 12+ years tech debt.

But the new hires are pushing back and re-introducing python, eith that argument of simplicity. Sure, python is much easier than a rust equivalent. Esp in early phases. But to me, 25+ years developer/engineer, yet new to python, it's unbelievable complex. Yes, uv solves some. As does ty and ruff. But, my goodness, what a mess to set up simple ci pipelines, a local development machine (that doesn't break my OS or other software on that machine). Hell, even the dockerfiles are magnitudes more complex than most others I've encountered.


I am not following the difficulties you have mentioned. Setting up a local dev environment in Python is trivial with UV.

The only major downside of Python is its got a bit poor module system and nothing as seamless as Cargo.

Beyond that the code is a million times easier to understand for a web app.


Again, "easy" is not the same as "simple".

"trivial" falls in the "easy" category. So it may not be hard to do. But what UV makes "easy" is managing something very complex under the hood.

Better example:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt

COPY . .

CMD ["python", "app.py"]

While "easy" it is nowhere near simple. Aside from the entire complexity of the stack of docker, that `python:3.9-slim` it itself is very complex. It installs over 20 "dev" packages (from bluetooth via tk to xz), it downloads source files, builds a python runtime, (patches that?), installs pip, setuptools, does some (to python people probably familiar?) "wheel" stuff, etc¹. Point being: what you end up with, while easy to get, is very complex.

uv manages a runtime, some virtual environment to hot-swap that with other runtimes, it hooks into a package manager, manages additional tools (linter, typechecker, lsp, etc) and so on. What lies under that is very complex.

¹ I am well aware that node, ruby, php are quite similar.


I was wondering this as well: Why did OP look for VC?

In my case, I've used a similar strategy of keeping costs under €100/month. (But have sold, or stopped my ventures before hitting such MRRs as OP reports).

I raised some capital to pay my own bills during development. But mostly to hire freelancers to work on parts that I'm bad at, or didn't have time for: advertising, a specific feature, a library, rewrite-in-rust (wink) or deep research into functional improvements.


I always thought I had to add a swap file to avoid crashing with OOM. I wasn't aware of the cold pages overhead.

Sometimes that crashing is what I want: a dedicated server running one (micro)service in a system that'll restart new servers on such crashes (e.g. Kubernetes-alike). I'd rather have it crash immediately rather than chugging along in degraded state.

But on a shared setup like OP shows, or the old LAMP-on-a-vps, i'd prefer the system to start swapping and have a chance to recover. IME it quite often does. Will take a few minutes (of near downtime) but will avoid data corruption or crash-loops much easier.

Basically, letting Linux handle recovery vs letting a monitoring system handle recovery


Plotting Churn against Complexity is far more useful than merely churn.

It shows places that are problematic much better. High churn, low complexity: fine. Its recognized and optimizef that this is worked on a lot (e.g. some mapping file, a dsl, business rules etc). Low churn high complexity: fine too. Its a mess, but no-one has to be there. But both? Thats probably where most bugs originate, where PRs block, where test coverage is poor and where everyone knows time is needed to refactor.

In fact, quite often I found that a teams' call "to rewrite the app from scratch" was really about those few high-churn-high-complexity modules, files or classes.

Complexity is a deep topic, but even simple checks like how nested smt is, or how many statements can do.


I once tutored an intern. Who thought he was The Best Programmer On Earth (didn't we all at that age?). He refused to use revision control, it slowed him down.

So we told him to commit at least once every day, with a relevant commit message, or else fail his internship.

He worked 21 more days. There were 21 commits: "17:00, time to go home".


This reads like the intern was left to his own devices and his output not checked at all for three weeks straight. Actual tutoring would have surfaced the issue after 1 or 2 days tops.


Oh, but I had a daily sit down and addressed this and other issues several times.

The problem was, as I mentioned, "he though he was the best developer ever". stubborn as hell. And pushed back against anything he wasn't used to, anything that wasn't his usual "ssh or ftp into prod and change stuff until it works" because he thought this was the only method that worked.

This was my first encounter with a self-proclaimed "10x" developer before that term existed. Someone who - as seen by far off management - seemingly have a high output, but in reality just create a trail of tech-debt and work for the rest of the team.


Dismissed.

But the gold price has been rising (on average) a lot over the period July 2025 to January 2026


From the annual report, it looks like the headline number (XXB gain) is just because it's realized capital gain (which due to their reporting requirement appears in their annual report, unlike unrealized gains).

They have ~same amount of gold between both years and it doesn't look like they took extra market risk.


> As the price of gold continued to rise as they did this,

Seems counterintuitive to me. This would only make gains when they bought the new gold before selling the old, or when there's some arbitrage going on between Gold/USD, Gold/EUR and USD/EUR.

If they first sold the old for USD, then bought the new for USD, with a rising gold price, they'd miss the price-gain during the time between the trades, when they held the USD. It'd be a loss, not a gain.

If there's some arbitrage going on, then I highly doubt that brings $15B gain. The differences would have to be huge.

I think the (author (AI)) writing that article is simply mixing up stuff. I think this gain is not a cause-effect of the conversion, merely the gains from rising gold prices on the gold it holds over that period.


The source is a press conference where they state the total amount and total value of gold stored hasn't changed. In le figaro they report the profit is due to variation in price between the different transactions. Which seems to be a polite way to say they took exceptional risk.


> In le figaro they report the profit is due to variation in price between the different transactions. Which seems to be a polite way to say they took exceptional risk.

Nah it's just regular realized gain (delta between acquisition price and selling price).

https://www.banque-france.fr/fr/actualites/resultats-2025-de...

(so it's kinda irrelevant, it's just they have to put it in their books)


They repatriated 129 tonnes in total, its was absolutely impossible to make $15B from that since that’s what 129 tonnes are worth in total more or less.


They didn't repatriate the gold in the sense of physically moving it from the US to France. Instead, they sold the gold that was held in the US and used the money raised to buy gold from other sources, which is held in France.

Different gold, and two financial transactions, accounts for the financial gain.


Yes but the article implies that they somehow made 15B in profit by selling the gold in US and buying an equivalent amount which can’t be the case.


What happened was that

a) they bought the gold long time ago for basically nothing and had it on their books valued at basically nothing

b) they sold it now (in the US) for around $15b and thus for accounting purposes realised a $15b gain

c) they bought it back (in France) for around $15b and will have it on the book now valued at $15b.

The fact that the gold price rose over the course of b) selling and c) buying doesn't matter (despite what the article implies). That the gold price rose between a) the original purchase and now b)c), that's what resulted in the profit.


Well they has 129 tonnes in US which happens to be wroth around $15B or so. Probably the author has no clue what they are talking about and grossly misinterpreted..


> BdF Governor Francois Villeroy de Galhau said the decision to keep the new bars in Paris is “not politically motivated,” as the higher-standard gold bars it bought were traded on a European market.


Well they are probably just being diplomatic, there is no point in accidentally triggering the ape.


To be fair, it's an ongoing process started in 2005 and which should finish in 2028. I doubt there was much political (tho the whole tariffs stuff probably made their job/decision easier when the gold price started diverging between NY and European markets). At this point it was cheaper than flying the gold to CH for recasting.

(1784 tons moved to standardized holding over the years, 134 tons are now left to convert -- all stored in Paris)


"We do not do this as a political statement —we simply want our gold ingots to exist next week."

Still, a win does signal a dumb process behind the trade as the smart move would be to hedge with future options and/or futures.

But then again, maybe they did hedge the trade and it's just not the right time or place to report it.


> Workdays!

This is javascript, not Java.

In JavaScript something entirely new would be invented, to solve a problem that has long been solved and is documented in 20+ year old books on common design patterns. So we can all copy-paste `{ or: [{ days: 42, months: 2, hours: "DEFAULT", minutes: "IGNORE", seconds: null, timezone: "defer-by-ip" }, { timestamp: 17749453211*1000, unit: "ms"}]` without any clue as to what we are defining.

In Java, a 6000LoC+ ecosystem of classes, abstractions, dependency-injectables and probably a new DSL would be invented so we can all say "over 4 Malaysian workdays"


But you know that Java solution will continue working even after we no longer use the Gregorian Calendar, the collapse and annexation of Malaysia to some foreign power, and then us finally switching to a 4-day work week; so it'd be worth it.


It probably won’t work correctly from the get go. But it can be debugged everywhere so that’s good.


... and since it was architectured to allow runtime injection-patching of events before they hit the enterprise-service-bus, everyone using this library must first set fourteen ENV vars in their profile, and provide a /etc/java/springtime/enterprise-workday-handling/parse-event-mismatch.jar.patch. Which should fix the bug for you.

You can find the patch files for your OSs by registering at Oracle with a J3EE8.4-PatchLibID (note, the older J3EE16-PatchLib-ids aren't compatible), attainable from your regional Oracle account-manager.


And least one of those environment can contain template strings that are expanded with arguments from request headers when run under popular enterprise java frameworks, and by way of the injection patching could hot load arbitrary code in runtime.

A joke should be funny though, not just a dry description of real life, so let's leave it at that. We've already taken it too far.


This isn’t even remotely funny.


I am laughing. I'm not even near the end of this thread.


In before someone thinks it's a joke, the most commonly used logging library in Java had LDAP support in format scripts enabled by default" (which resulted, of course in CVE)


JavaScript Temporal. Not sure knowing what a "workday" is in each timezone is in it's scope but it's the much needed and improved JS, date API (granted with limited support to date)

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


There's an extra digit in your timestamp.


Ah but you don't know how far in the future I intended it to be :)


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: