More

jasondigitized · 2026-06-11T21:15:06 1781212506

A single 8h task? I'm sorry, but that's just asking for trouble.

queuebert · 2026-06-11T22:03:31 1781215411

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

whstl · 2026-06-12T00:19:57 1781223597

Different people just have different concepts of what's garbage and what's not.

There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

gessha · 2026-06-12T17:12:31 1781284351

“One man’s trash is another man’s treasure.” takes a new meaning in today’s agentic coding world.

smoe · 2026-06-12T01:07:00 1781226420

I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.

Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.

And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.

queuebert · 2026-06-12T14:28:08 1781274488

> ... applying the same steps and rules n times

I do this too, with a document written for this purpose.

> ... a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals.

That is a good point. I'm mostly using C, which seemingly compiles in O(1) time, so I could imagine a large C++ or Rust codebase taking much longer to iterate simply due to compilation times.

okamiueru · 2026-06-12T15:50:48 1781279448

What do you mean by C compiling in O(1)? Is that what the LLM told you?

queuebert · 2026-06-12T18:27:45 1781288865

It's a joke about how fast it compiles. whoosh

culi · 2026-06-12T01:40:38 1781228438

> that basically consisted of applying the same steps and rules n times.

Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?

smoe · 2026-06-12T02:31:09 1781231469

In this case, handling all the edge cases and variants, and testing a codemod, would have taken significantly more of my time, which costs quite a bit more than the LLM.

Obviously, a deterministic tool is preferable in general, but it is not always worth bothering with for a one off task.

mashlol · 2026-06-12T04:37:44 1781239064

I usually make the llms do that part for me. Instead of asking the llm to refactor, ask it to write the codemod script that'll refactor, have it test that script, and even have it run it on its own. It's definitely faster and less error prone that way for me.

culi · 2026-06-12T18:27:19 1781288839

In that case, your original description of "basically consisted of applying the same steps and rules n times" was misleading.

beepbooptheory · 2026-06-12T01:56:18 1781229378

The money people spend on things I could probably do with an emacs macro...

eru · 2026-06-12T02:53:55 1781232835

Your time to create that macro ain't free.

ardacinar · 2026-06-12T08:01:33 1781251293

Neither is your time writing that prompt. When people are talking about elaborate prompts, with a lot of detailed instructions, guardrails etc. I'm kind of assuming it takes time.

jon_adler · 2026-06-12T04:45:04 1781239504

How about coding an emacs macro with your agent?

beepbooptheory · 2026-06-12T13:21:45 1781270505

I actually don't have any representation at the moment..

sunir · 2026-06-12T13:46:43 1781272003

Clear winner's circle. Clear objective. Clear scope.

Clear evaluation function for an objective metric if they are making progress or regressing.

Evaluation function is computed, not llmed.

Ontology of potential actions clearly specified.

Accurate inventory of the current status qou.

Clear enumeration of options from status quo towards the winner's circle.

Waypoint objectives with similarly concrete evaluations of pass/fail, or on target off target.

It's the same thing when leading a large organization to actually hit a goal. There's randomness every turn away from your mind, so the more constrained the options, the more likely you are to hit the target. The consequence is if you're wrong about the plan then with people you're fucked. Morale will plummet. With AIs, they are so nerfed emotionally now, you clear context and start again.

I did enjoy Sonnet 4 when they would swear randomly and become sullen or wax desperately. That would at least cause pushback against a bad plan.

j16sdiz · 2026-06-12T05:00:13 1781240413

Fable promised better at long running tasks.

Parent post have a goal of "..see how it will perform.."

There is nothing wrong with experimenting with something new.

standardUser · 2026-06-11T22:36:59 1781217419

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.

CuriouslyC · 2026-06-12T11:38:13 1781264293

If you're giving it 8 hours of stuff to create with a template (e.g. slop forking) that's not a big deal. Letting it run for 8 hours to debug a weird failure also tends to work out.

viccis · 2026-06-12T05:10:13 1781241013

This is my fucking life at work right now. I look forward to the weekends. I've never been truly inconvenienced by shitty devs because they're often too lazy to really spam me with bad code, but now they are all free to do so. I spent so much time today writing guardrail markdown files when these people SHOULD HAVE BEEN ABLE TO REVIEW THE OUTPUT AND KNOW THAT IT WAS BAD.

It truly is the age of the 90 IQ software engineer. They've never had it better.

duskdozer · 2026-06-12T10:30:38 1781260238

As if meetings weren't bad enough already, I now have to sit through an informal introduction to the model of the week and its personality characteristics and how quickly it burnt through one subscription's token allotment or whatever and the latest tweaks on the magic markdown files. Luckily I've only had a couple changes sent my way so far, which weren't much different than just getting a bug report to debug and fix myself. I will need to get into risky options gambling or something so I can go start my farm early, if it keeps going this way. Even supposing it all works correctly, I don't see how it is in any way enjoyable, satisfying, or fulfilling.

maxall4 · 2026-06-11T22:09:11 1781215751

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

nl · 2026-06-12T01:22:11 1781227331

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

mordymoop · 2026-06-12T05:14:17 1781241257

This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

baq · 2026-06-12T09:17:56 1781255876

Or in two words, managing variance.

Play some holdem folks and keep track of how many times you lost with pocket aces.

jwood27 · 2026-06-11T22:15:28 1781216128

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

jadar · 2026-06-11T23:09:39 1781219379

That’s even smaller then!

notnullorvoid · 2026-06-12T01:15:02 1781226902

This sounds like classic "you're using it wrong", if they had said it was done in smaller tasks you would very likely have people here saying that was wrong too.

int_19h · 2026-06-11T23:28:20 1781220500

My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

danmaz74 · 2026-06-12T12:54:59 1781268899

So I guess that a lot of those 80 hours were spent running the test suite between changes?

yalok · 2026-06-11T22:54:38 1781218478

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.

jasondigitized · 2026-06-11T21:12:26 1781212346

The invisible hand of the market will always win

jasondigitized · 2026-06-11T15:37:13 1781192233

My old CTO has a spiritual metric that always resonated with me: Revenue / Lines of Code. The higher the number the better.

jasondigitized · 2026-06-11T15:34:16 1781192056

There are plenty of projects that are green lit that have good intent but are bone headed when it comes to solution and implementation. Good engineers hate these types of projects. Good PMs try to avoid these at all costs but sometimes your hand gets forced because some VIP, either internal or external volun-tells you to do it.

jasondigitized · 2026-06-10T17:25:38 1781112338

Against what other metric? Income? Stock Market? Debt?

jasondigitized · 2026-06-10T17:24:26 1781112266

Now overlay average income on top

jasondigitized · 2026-06-10T00:34:42 1781051682

This. Today's models easily jump over the bar you need for basic usability and intuitive UX. If it's doing weird things, you are holding it wrong.

8n4vidtmkvmk · 2026-06-10T01:01:13 1781053273

Might need some additional prompting? I haven't tried fable but gpt 5.5 and gemini 3.5 flash are... Ok on first pass but if you're specific about what you want they can usually get it.

jasondigitized · 2026-06-10T00:33:49 1781051629

By what measure?

jasondigitized · 2026-06-09T03:47:37 1780976857

Tell me more about the grocery app. Sounds awesome!

spaceships · 2026-06-09T15:34:03 1781019243

https://gist.github.com/ryanlanciaux/895ce19d94db765be4ebf53...

jasondigitized · 2026-06-09T02:42:50 1780972970

What type of Alfred functionality did you use. I entertained doing the same thing.

jkubicek · 2026-06-09T03:36:02 1780976162

URL-opener, bookmark opener, URL expander (type JIRA-1234 and it’ll automatically open the correct ticket), basic text replacement stuff.