More

supermdguy · 2026-03-29T21:49:10 1774820950

I’ve tried having one “big” task that I’m focusing on with active back and forth while letting other Claude instances handle easier back-burner type tasks that it can effectively one-shot. But I’ve noticed that often turns into me spending more time/focus than I’d want on tasks that aren’t actually that impactful. I still think I get more done than I would otherwise, but I still haven’t found the best management strategy.

supermdguy · 2026-03-28T23:23:59 1774740239

Yeah that confused me, but the compression paper also doesn’t make a ton of sense since I doubt Google would have released it if it was actually such a competitive advantage compared to what other labs are doing. So I wonder what’s actually causing the price decrease.

supermdguy · 2026-03-28T23:07:35 1774739255

Any other sources on the OpenAI claim? Regardless, it’ll be nice to have cheaper RAM

supermdguy · 2026-03-28T23:06:23 1774739183

Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.

Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.

hauntsaninja · 2026-03-28T23:44:54 1774741494

Yay, I had fun with it too!

IIUC the way you'd do that right now is just repeatedly recording the individual observations on a single commit, which effectively gives it a probability + the number of trials to do the Beta update. I don't yet have a CLI entrypoint to record a batch observation of (probability, num_trials), but it would be easy to add one

But ofc part of the magic is that git_bayesect's commit selection tells you how to be maximally sample efficient, so you'd only want to do a batch record if your test has high constant overhead

__s · 2026-04-01T20:46:49 1775076409

recompiling can be high constant overhead

ajb · 2026-04-01T21:30:22 1775079022

In theory, the algorithm could deal with that by choosing the commit at each step, which gives the best expected information gain; divided by expected test time. In most cases it would be more efficient just to cache the compiled output though.

supermdguy · 2026-03-26T20:54:22 1774558462

It's surprising that this works so well considering that AI-generated AGENTS.md files have been shown to be not very useful. I think the key difference here is that the real-world experience helps the agent reach regions of its latent space that wouldn't occur naturally through autoregression.

I wonder how much of the improvement is due to the agent actually learning new things vs. reaching parts of its latent space that enable it to recall things it already knows. Did the agent come up with novel RL reward design protocols based on trial and error? Or did the tokens in the environment cause it to "act smarter"?

supermdguy · 2026-03-25T22:00:58 1774476058

> Nobody knows yet the true capabilities of the missile, but it doesn’t matter. The accuracy doesn’t matter very much, the payload doesn’t matter very much. If it’s launched at a certain target in Tel Aviv, it still is going to hit something in Tel Aviv. The Israelis have no choice but to attempt an intercept, and will spend millions to do so

Sounds like the massive price disparity more than makes up for any accuracy issues

irishcoffee · 2026-03-25T22:24:35 1774477475

Clearly accuracy does matter. I just tried to throw a rock from my back yard to Tel Aviv, I missed terribly.

bluGill · 2026-03-25T23:26:00 1774481160

iron dome is about $100k to intercept according to wikipedia. Millions is off by an order of magnatude. I suspect they can make it cheaper with scale as well.

bigyabai · 2026-03-25T23:47:38 1774482458

$100k is the cost of the low-speed ~mach 2.2 Tamir interceptor, which is effective against shells and rockets but not going to intercept a maneuvering mach 7 glide body.

XorNot · 2026-03-26T00:17:28 1774484248

There is absolutely no way anyone is producing a manoeuvring mach 7 missile for $100,000 though.

The term "hypersonic" is incredibly overloaded.

bigyabai · 2026-03-26T17:10:20 1774545020

Sure, I agree with that. I've not seen a booster stack at that price hit mach 4-5 reliably, mach 7 would be unprecedented.

Still important to clarify that HGVs are intended to defeat these cheaper intercept layers.

supermdguy · 2026-03-18T00:36:17 1773794177

> Experimenting with ideas/refactors to see how they'll play out (often the agent can just tell you how it's going to play out)

This has helped me a lot. Normally I'd feel really attached to big refactors because of sunk costs, but when AI does a huge refactor it's easier to honestly decide that it wasn't worth it and unnecessarily increased complexity.

supermdguy · 2026-03-15T04:20:45 1773548445

> One might note that MCTS uses more inference compute on a per-sample basis than GRPO: of course it performs better

This part confused me, it sounded like they were only doing the MCTS at train time, and then using GRPO to distill the MCTS policy into the model weights. So wouldn’t the model still have the same inference cost?

at2005 · 2026-03-15T06:50:01 1773557401

Ah, I meant that MCTS uses more inference-time compute (over GRPO) to produce a training sample

supermdguy · 2026-03-11T01:09:53 1773191393

I’ve actually written a crawler like that before, and still ended up going with Firecrawl for a more recent project. There’s just so many headaches at scale: OOMs from heavy pages, proxies for sites that block cloud IPs, handling nested iframes, etc.

supermdguy · 2026-03-07T01:24:51 1772846691

I’ve been trying to learn a lot about domain driven design, I think knowledge crunching will be a huge part of the new software development role.