Hacker Newsnew | past | comments | ask | show | jobs | submit | ghosty141's commentslogin

It's obviously not as easy as you make it sound, it was reverted since it broke some existing apps.

I can see it being useful for stuff like renaming files, splitting hpp/cpp files, doing renaming etc.

Yeah but using a different model for that? Ideally a skilled model that works with the IDE functionality/API would “solve” this issue.

I find this absolutely wild. From my experience Codex code quality is still not as good as a human so letting codex do smth and not verifying / cleaning up behind it will most likely result in lower code quality and possibly subtle bugs.

For upgrading frameworks and such there are usually not that many architectural decisions to be made, where you care about how exactly something is implemented. Here the OP could probably verify the build works, with all the expected artifacts quite easily.

What codex often does for this, write a small python script and execute that to bulk rename for example.

I agree that there is use for fast "simpler" models, there are many tasks where the regular codex-5.3 is not necessary but I think it's rarely worth the extra friction of switching from regular 5.3 to 5.3-spark.


According to the last comment in the issue it is already available for self hosted clients.

Its not a meme.

For example, C has pointer provenance, so pointers arent just addresses. Thats why type punning is such a mess. If a lang claims to be super close to the hardware this seems like a very weird thing.


C is super close to the hardware in that it works exactly like the abstract C machine, which is kind of a generalization of the common subset of a lot of machines, invented to make it portable, i.e. viable to be implemented straightforwardly on various architectures. For example pointer provenance makes it work on machines with segmented storage, these can occur anywhere, so there is no guarantee that addresses beyond a single allocation are expressible or meaningful.

What makes C feel free for programming is that instead of prescribing an implementation paradigm, it instead exposes a computing model and then lets the programmer write whatever is possible with that (and also what is not -- UB). And a lot of higher level abstractions are quickly implemented in C, e.g. inheritance and polymorphism, but then they still allow to be used in ways you like, so you can not just do pure class inheritance, but get creative with a vtable, or just use another vtable with the same object. These are things you can't do when the classes are a language construct.


The C abstract machine is exactly the important part. There is a difference between saying C is close to "the hardware" and C is close to the C abstract machine. The latter like you described has a few concepts that allow for abstraction and thus portability but obviously they lead to situations where the "maps to the hardware" doesn't seem to hold true.

My gripe is only with people acting like the C abstract machine doesn't exist and C is just syntax sugar for a bit of assembly. It's a bit more involved than that.


> The C abstract machine is exactly the important part. ... My gripe is only with people acting like the C abstract machine doesn't exist and C is just syntax sugar for a bit of assembly. It's a bit more involved than that.

Most people have no understanding of an abstract machine though the very idea of a high-level programming language is based on it.

The C Language Standard itself specifies "Program Execution" only on a "Abstract Machine". Mapping that abstract machine to an ISA/Memory on real hardware is the task of the C compiler. It can do this in any manner as long as the observable behaviour of the program is "as-if" it ran on the abstract machine.

Relevant quote:

A conforming implementation executing a well-formed program shall produce the same observable behavior as one of the possible executions of the corresponding instance of the abstract machine with the same program and the same input.

Further Resources;

Wikipedia Abstract machine - https://en.wikipedia.org/wiki/Abstract_machine

Abstract machines for programming language implementation (pdf) - https://www.rw.cdl.uni-saarland.de/people/diehl/private/pubs...

The Abstract Machine: A Pattern for Designing Abstract Machines (pdf) - https://www.plopcon.org/pastplops/plop99/proceedings/garcia/...


It remains close to computer hardware + optimiser. This provenance thing was introduced to aid optimisation.

This. I think software development is the best usecase for AI yet. I use it almost daily at work and it's a huge help.

Enterprise customers will happily pay even 100$/mo subscriptions and it has a clear value proposition that can be decently verified.


Revenue should not be confused with profit. The large AI companies must easily be spending more on compute than they're making from a $20-200/mo subscription. In the best case it might break even for the AI companies. There is no way that they're actually earning a profit from these subscriptions at this time.

I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.

Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.

It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.

LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.


I'm personally 100% convinced of the opposite, that it's a waste of time to steer them. we know now that agentic loops can converge given the proper framing and self-reflectiveness tools.

Converge towards what though... I think the level of testing/verification you need to have an LLM output a non-trivial feature (e.g. Paxos/anything with concurrency, business logic that isn't just "fetch value from spreadsheet, add to another number and save to the database") is pretty high.

in the new world, engineers have to actually be good at capturing and interpreting requirements

But we’ve been here before. The agile movement originated as a response to the multifarious problems of big design up front.

In this new world, why stop there? It would be even better if engineers were also medical doctors and held multiple doctorate degrees in mathematics and physics and also were rockstar sales people.

sounds like the kinds of hyperbole someone whose just been forced to set a linter for the first time

As a doctor, this sounds like an engineers job.

> it's a waste of time to steer them

It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.

There's a balance to strike between micro-management and no steering at all.


The prompt is decreasingly relevant. The verification environment you have is what actually matters.

I think this all comes down to information.

Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.

The same applies to verification: it's fundamentally an information problem.

You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.


Having prompts be information deficient is the whole point of LLMs. The only complete description of a typical programming problem is the final code or an equivalent formal specification.

Exactly the point. But, LLM's miss that human intuition part.

Does the AI agent know what your company is doing right now, what every coworker is working on, how they are doing it, and how your boss will change priorities next month without being told?

If it really knows better, then fire everyone and let the agent take charge. lol


No, but Codex wouldn’t have asked you those questions either

For me, it still asks for confirmation at every decision when using plans. And when multiple unforeseen options appear, it asks again. I don’t think you’ve used Codex in a while.

It asks you what your coworkers are working on and whether the thing you are working on are your boss’ number one priority?

skill issue

A significant portion of engineering time is now spent ensuring that yes, the LLM does know about all of that. This context can be surfaced through skills, MCP, connectors, RAG over your tools, etc. Companies are also starting to reshape their entire processes to ensure this information can be properly and accurately surfaced. Most are still far from completing that transformation, but progress tends to happen slowly, then all at once.

[flagged]


All we can do is try our best to look at the world with clear eyes, and think about where the industry's going over the next couple years

Not how we want things to be, but how they actually are and will be

I don't think AI for programming is a passing fad


Who hurt you?

Also what are you even proposing/advocating for here?

This meta-state-of-company context is just as capturable as anything else with the right lines of questioning and spyware and UI/UX to elicit it.


> given the proper framing

This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.

You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.

If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.

If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.


> If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.

Trying to get my company to realize this right now.

Probably the most efficient way to work, would be on a video call including the product person/stakeholder, designer, and me, the one responsible for the actual code, so that we can churn through the now incredibly fast and cheap implementation step together in pure alignment.

You could probably do it async but it’s so much faster to not have to keep waiting for one another.


Maybe some day, but as a claude code user it makes enough pretty serious screw ups, even with a very clearly defined plan, that I review everything it produces.

You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.


I use that to feed back into my spec development and prompting and CI harnesses, not steering in real time.

Every mistake is a chance to fix the system so that mistake is less likely or impossible.

I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.

This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.


> You're no longer developing software, you're doing therapy for robots.

Or, really, hacking in "learning", building your knowhow-base.

> But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.

Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:

https://github.com/anthropics/knowledge-work-plugins

Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.

For those following along at home:

This is the return of the "expert system", now running on a generalized "expert system machine".


I assumed you'd build such a massive set of rules (that claude often does not obey) that you'd eat up your context very quickly. I've actually removed all plugins / MCPs because they chewed up way too much context.

It's as much about what to remove as what to add. Curation is the key. Skills also give you some levers to get the kind of context-sensitive instruction you need, though I haven't delved too deeply into them. My current total instruction set is around ~2500 tokens at the moment

Reviewing what it produces once it thinks it has met the acceptance criteria and the test suite passes is very different from wasting time babysitting every tiny change.

True, and that's usually what I'm doing now, but to be honest I'm also giving all of it's code at least a cursory glance.

Some of the things it occasionally does:

- Ignores conventions (even when emphasized in the CLAUDE.md)

- Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)

- Writes badly performing code (N+1)

- Does more than you asked (in a bad way, changing UIs or adding cruft)

- Makes generally bad assumptions

I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.


Sure, but non of those things requires you to watch it work. They're all easy to pick up on when reviewing a finished change, which ideally should come after it's instructions have had it run linters, run sub agents that verify it has added tests, run sub agents doing a code review.

I don't want to waste my time reviewing a change the model can still significantly improve all by itself. My time costs far more than the models.


then you're using it wrong, to be frank with you.

you give it tools so it can compile and run the code. then you give it more tools so it can decide between iterations if it got closer to the goal or not. let it evaluate itself. if it can't evaluate something, let it write tests and benchmark itself.

I guarantee that if the criteria is very well defined and benchmarkable, it will do the right thing in X iterations.

(I don't do UI development. I do end-to-end system performance on two very large code bases. my tests can be measured. the measure is very simply binary: better or not. it works.)


That’s what oh-my-open-code does.

good luck.

I've been working on very complex problems with this model and the results I have have surprised people over and over again.

I've been using codex for one week and I have been the most productive I have ever been. Small prs, tight rules, I get almost exactly what I want. Things tend to go sideways when scope creeps into my request. But I just close the PR instead of fighting with the agent. In one week: 28 prs, 26 merged. Absolutely unreal.

I will personally never consider using an agent that can't be easily pushed toward working on its own for long periods (hours) at a time. It's a total waste of time for me to babysit the LLM.

> If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to

That could easily be automated.


But tokens are way cheaper than human labor

Aider was doing this a long time ago

Yes they are, the fact that the agents have full access to your local project files makes a gigantic difference.

They do *very* well at things like: "Explain what this class does" or "Find the biggest pain points of the project architecture".

No comparison to regular ChatGPT when it comes to software development. I suggest trying it out, and not by saying "implement game" but rather try it by giving it clear scoped tasks where the AI doesn't have to think or abstract/generalize. So as some kind of code-monkey.


I think a lot of AI talk doesn't explain where it shines the brightest (imo): Write the code you don't want to write.

I've recently had an issue "add VNC authentication" which covers adding vnc password auth to our inhouse vnc server at work.

This is not hard, but just a bit of tedious work getting the plumbing done, adding some UI for the settings, fiddle with some bits according to the spec.

But it's (at least to me) not very enjoyable, there is nothing to learn, nothing new to discover, no much creativity necessary etc. and this is where Codex comes in. As long as you give it clearly scoped tasks in an environment where it can use existing structures and convetions, it will deliver. In this case it implemented 85% of the feature perfectly and I only had to tweak minor things like refactor 1-2 functions. Obviously I read and understood and checked everything it wrote, that is an absolute must for serious work.

So my point is, use AI as the "code monkey". I believe most developers enjoy the creative aspects of the job, but not the "type C++ on your keyboard". AI can help with the latter, it will type what you tell it and you can focuse on the architecture and creative part of the whole thing.

You don't have to trust AI in that sense, use it like autocompletion, you can program perfectly fine without it but it makes your fingers hurt more.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: