Hacker Newsnew | past | comments | ask | show | jobs | submit | jorl17's commentslogin

Out of curiosity, what’s your flow? Do you have codex write plans to markdown files? Just chat? What languages or frameworks do you use?

I’m an avid cursor user (with opus), and have been trying alternatives recently. Codex has been an immense letdown. I think I was too spoiled by cursor’s UX and internal planning prompt.

It’s incredibly slow, produces terribly verbose and over-complicated code (unless I use high or xhigh, which are even slower), and missed a lot of details. Python/django and react frontend.

For the first time I felt like I could relate to those people who say it doesn’t make them faster,” because they have to keep fixing the agent’s shot, never felt that with opus 4.5 and 4.6 and cursor


Codex cli is a very performant cli though, better than any other cli code assistant I've used.

I mean does it matter what code it's producing? If it renders and functions just use it. I think it's better to take the L on verbose code and optimizing the really ugly bits by hand in a few minutes than be kneecapped every 5 hour by limits and constant pleas to shift to Sonnet.


Out of curiosity, what do you feel are the key differences between cursor + models versus something like Claude Code/Codex?

Are you feeling the benefits of the switch? What prompted you to change?

I've been running cursor with my own workflows (where planning is definitely a key step) and it's been great. However, the feeling of missing out, coupled with the fact I am a paying ChatGPT customer, got me to try codex. It hasn't really clicked in what way this is better, as so far it really hasn't been.

I have this feeling that supposedly you can give these tools a bit more of a hands-off approach so maybe I just haven't really done that yet. Haven't fiddled with worktrees or anything else yet either.


AFAICT it really is just a preference for terminal vs IDE. The terminal folks often believe terminal is intrinsically better and say things like “you’re still using an IDE.” Yegge makes this quite explicit in his gastown manifesto.

I been using Unix command lines since before most people here were born. And I actively prefer cursor to the text only coding agents. I like being able to browse the code next to the chat and easily switch between sessions, see markdown rendered properly, etc.

On fundamentals I think the differences are vanishing. They have converged on the same skills format standards. Cursor uses RAG for file lookups but Claude reads the whole file - token efficiency vs completeness. They both seem to periodically innovate some orchestration function which the other copies a few weeks later.

I think it really is just a stylistic preference. But the Claude people seem convinced Claude is better. Having spent a bunch of time analyzing both I just don’t see it.


Claude is quite good at European Portuguese in my limited tests. Gemini 3 is also very good. ChatGPT is just OK and keeps code-switching all the time, it's very bizarre.

I used to think of Gemini as the lead in terms of Portuguese, but recently subjectively started enjoying Claude more (even before Opus 4.5).

In spite of this, ChatGPT is what I use for everyday conversational chat because it has loads of memories there, because of the top of the line voice AI, and, mostly, because I just brainstorm or do 1-off searches with it. I think effectively ChatGPT is my new Google and first scratchpad for ideas.


When I first started coding with LLMs, I could show a bug to an LLM and it would start to bugfix it, and very quickly would fall down a path of "I've got it! This is it! No wait, the print command here isn't working because an electron beam was pointed at the computer".

Nowadays, I have often seen LLMs (Opus 4.5) give up on their original ideas and assumptions. Sometimes I tell them what I think the problem is, and they look at it, test it out, and decide I was wrong (and I was).

There are still times where they get stuck on an idea, but they are becoming increasingly rare.

Therefore, think that modern LLMs clearly are already able to question their assumptions and notice when framing is wrong. In fact, they've been invaluable to me in fixing complicated bugs in minutes instead of hours because of how much they tend to question many assumptions and throw out hypotheses. They've helped _me_ question some of my assumptions.

They're inconsistent, but they have been doing this. Even to my surprise.


agree on that and the speed is fantastic with them, and also that the dynamics of questioning the current session's assumptions has gotten way better.

yet - given an existing codebase (even not huge) they often won't suggest "we need to restructure this part differently to solve this bug". Instead they tend to push forward.


You are right, agreed.

Having realized that, perhaps you are right that we may need a different architecture. Time will tell!


This is the first model to which I send my collection of nearly 900 poems and an extremely simple prompt (in Portuguese), and it manages to produce an impeccable analysis of the poems, as a (barely) cohesive whole, which span 15 years.

It does not make a single mistake, it identifies neologisms, hidden meaning, 7 distinct poetic phases, recurring themes, fragments/heteronyms, related authors. It has left me completely speechless.

Speechless. I am speechless.

Perhaps Opus 4.5 could do it too — I don't know because I needed the 1M context window for this.

I cannot put into words how shocked I am at this. I use LLMs daily, I code with agents, I am extremely bullish on AI and, still, I am shocked.

I have used my poetry and an analysis of it as a personal metric for how good models are. Gemini 2.5 pro was the first time a model could keep track of the breadth of the work without getting lost, but Opus 4.6 straight up does not get anything wrong and goes beyond that to identify things (key poems, key motifs, and many other things) that I would always have to kind of trick the models into producing. I would always feel like I was leading the models on. But this — this — this is unbelievable. Unbelievable. Insane.

This "key poem" thing is particularly surreal to me. Out of 900 poems, while analyzing the collection, it picked 12 "key poems, and I do agree that 11 of those would be on my 30-or-so "key poem list". What's amazing is that whenever I explicitly asked any model, to this date, to do it, they would get maybe 2 or 3, but mostly fail completely.

What is this sorcery?


This sounds wayyyy over the top for a mode that released 10 mins ago. At least wait an hour or so before spewing breathless hype.

He just explained a specific personal example why he is hyped up, did you read a word of it?

Yeah, I read it.

“Speechless, shocked, unbelievable, insane, speechless”, etc.

Not a lot of real substance there.


Give the guy a chance.

Me too I was "Speechless, shocked, unbelievable, insane, speechless" the first time I sent Claude Code on a complicated 10-year code base which used outdated cross-toolchains and APIs. It obviously did not work anymore and had not been for a long time.

I saw the AI research the web and update the embedded toolchain, APIs to external weather services, etc... into a complete working new (WORKING!) code base in about 30 minutes.

Speechless, I was ...


Could you please post the key poems? Would love to read them.

I am way too self-conscious to do that :) Plus they are almost all in Portuguese!

> What is this sorcery?

The one you'll be seeking counter-spells against pretty soon.


Can you compare the result to using 5.2 thinking and gemini 3 pro?

I can run the comparison again, and also include OpenAI's new release (if the context is long enough), but, last time I did it, they weren't even in the same league.

When I last did it, 5.X thinking (can't remember which it was) had this terrible habit of code-switching between english and portuguese that made it sound like a robot (an agent to do things, rather than a human writing an essay), and it just didn't really "reason" effectively over the poems.

I can't explain it in any other way other than: "5.X thinking interprets this body of work in a way that is plausible, but I know, as the author, to be wrong; and I expect most people would also eventually find it to be wrong, as if it is being only very superficially looked at, or looked at by a high-schooler".

Gemini 3, at the time, was the worst of them, with some hallucinations, date mix ups (mixing poems from 2023 with poems from 2019), and overall just feeling quite lost and making very outlandish interpretations of the work. To be honest it sort of feels like Gemini hasn't been able to progress on this task since 2.5 pro (it has definitely improved on other things — I've recently switched to Gemini 3 on a product that was using 2.5 before)

Last time I did this test, Sonnet 4.5 was better than 5.X Thinking and Gemini 3 pro, but not exceedingly so. It's all so subjective, but the best I can say is it "felt like the analysis of the work I could agree with the most". I felt more seen and understood, if that makes sense (it is poetry, after all). Plus when I got each LLM to try to tell me everything it "knew" about me from the poems, Sonnet 4.5 got the most things right (though they were all very close).

Will bring back results soon.

Edit:

I (re-)tested:

- Gemini 3 (Pro)

- Gemini 3 (Flash)

- GPT 5.2

- Sonnet 4.5

Having seen Opus 4.5, they all seem very similar, and I can't really distinguish them in terms of depth and accuracy of analysis. They obviously have differences, especially stylistic ones, but, when compared with Opus 4.5 they're all on the same ballpark.

These models produce rather superficial analyses (when compared with Opus 4.5), missing out on several key things that Opus 4.5 got, such as specific and recurring neologisms and expressions, accurate connections to authors that serve as inspiration (Claude 4.5 gets them right, the other models get _close_, but not quite), and the meaning of some specific symbols in my poetry (Opus 4.5 identifies the symbols and the meaning; the other models identify most of the symbols, but fail to grasp the meaning sometimes).

Most of what these models say is true, but it really feels incomplete. Like half-truths or only a surface-level inquiry into truth.

As another example, Opus 4.5 identifies 7 distinct poetic phases, whereas Gemini 3 (Pro) identifies 4 which are technically correct, but miss out on key form and content transitions. When I look back, I personally agree with the 7 (maybe 6), but definitely not 4.

These models also clearly get some facts mixed up which Opus 4.5 did not (such as inferred timelines for some personal events). After having posted my comment to HN, I've been engaging with Opus4.5 and have managed to get it to also slip up on some dates, but not nearly as much as other models.

The other models also seem to produce shorter analyses, with a tendency to hyperfocus on some specific aspects of my poetry, missing a bunch of them.

--

To be fair, all of these models produce very good analyses which would take someone a lot of patience and probably weeks or months of work (which of course will never happen, it's a thought experiment).

It is entirely possible that the extremely simple prompt I used is just better with Claude Opus 4.5/4.6. But I will note that I have used very long and detailed prompts in the past with the other models and they've never really given me this level of....fidelity...about how I view my own work.


Disgusting.

An agent can always be told what to do by a human.

However, a human can't do what a human can't do. For example, a human can't answer in superhuman speed. A way to be somewhat certain that an agent is the one responding is to send them a barrage of questions/challenges that could only be answered correctly, fast, without any thought, without a human in the loop, and ones for which a human could not write a computer program to simulate an agent (at least not fast enough)

I think this is very achievable, and I can think of many plausible ways to explore "speed of response/action" as a way of identifying an agent operating. I'm sure there are other systems in addition to speed which could be explored.

Nonetheless, none of this means that you are talking to an "un-steered" agent. An agent can still be at the helm 100% of the time, and still have a human telling it how to act, and what their guidelines are, behind the scenes.

I find this all so fascinating.


Someone can tell an agent to post their text verbatim, but respond to all questions/challenges.

QoS?

Does sound like a QoS thing, but I would think QoS still applies over WiFi so I'm not sure

Good guess! I did enable some QoS algorithm in openwrt so it's certainly possible. But yeah it doesn't make sense that wifi traffic was non-impacting. But it's something I didn't try turning off.

When you're desperate you'll troubleshooting anything hey

This is only very tangentially related, but I got flashbacks to a time where we had dozens of edge/IoT raspberry pi devices with completely unupgradeable kernels with a bug that would make the whole USB stack shut down after "roughly a week" (7-9 days) of uptime. Once it got shut down, the only way to fix it was to do a full restart, and, at the time, we couldn't really be restarting those devices (not even at night).

This means that every single device would seemingly randomly completely break: touchscreen, keyboard, modems, you name it. Everything broke. And since the modem was part of it, we would lose access to the device — very hard to solve because maintenance teams were sometimes hours (& flights!) away.

It seemed to happen at random, and it was very hard to trace it down because we were also gearing up for an absolutely massive (hundreds of devices, and then a couple of months later, thousands) launch, and had pretty much every conceivable issue thrown at us, from faulty USB hubs, broken modems (which would also kill the USB hub if they pulled too much power), and I'm sure I've forgotten a bunch of other issues.

Plus, since the problem took a week to manifest, we couldn't really iterate on fixes quickly - after deploying a "potential fix", we'd have to wait a whole week to actually see if it worked. I can vividly remember the joy I had when I managed to get the issue to consistently happen only in the span of 2 hours instead of a week. I had no idea _why_, but at least I could now get serviceable feedback loops.

Eventually, after trying to mess with every variable we could, and isolating this specific issue from the other ones, we somehow figured out that the issue was indeed a bug in the kernel, or at least in one of its drivers: https://github.com/raspberrypi/linux/issues/5088 . We had many serial ports and a pattern of opening and closing them which triggered the issue. Upgrading the kernel was impossible due to a specific vendor lock-in, and we had to fix live devices and ship hundreds of them in less than a month.

In the end, we managed to build several layers on top of this unpatchable ever-growing USB-incapacitating bug: (i) we changed our serial port access patterns to significantly reduce the frequency of crashes; (ii) we adjusted boot parameters to make it much harder to trigger (aka "throw more memory at the memory leak"); (iii) we built a system that proactively detected the issue and triggered a USB reset in a very controlled fashion (this would sometimes kill the network of the device for a while, but we had no choice!); (iv) if, for some reason, all else failed, a watchdog would still reboot the system (but we really _really_ _reaaaally_ didn't want this to happen).

In a way, even though these issues suck, it's when we are faced with them that we really grow. We need to grab our whole troubleshooting arsenal, do things that would otherwise feel "wrong" or "inelegant", and push through the issues. Just thinking back to that period, I'm engulfed by a mix of gratitude for how much I learned, and an uneasy sense of dread (what if next time I won't be able to figure it out)?


Even National Instruments had this type of bug in their nivisa driver, that powers a good portion of lab and test equipment of the world. Every 31 days our test equipment would stop working, which happens to be the overflow of one of the windows timers. was also one of the fasted bug fix updates I ever saw, after reporting it!


I've always been sceptical of the modern tendency of throwing powerful hardware at every embedded projects. In most cases good old atmel AVR or even 8051 would suffice.


I think I used to have that view as well, and in a way still do, but this particular project proved otherwise.

The first version was built pretty much that way, with a tiny microcontroller and extremely optimized code. The problem then became that it was very hard to iterate quickly on it and prototype new features. Every new piece of hardware that was added (or just evaluated) would have to be carefully integrated and it really added to the mess. Maybe it would have been different if the code had been structured with more care from the get-go, who knows (I entered the project already in version 2).

For version 2, the micro-controller was thrown out, and raspberry-pi based solutions were brought in. Sure, it felt like carrying a shotgun to fire at a couple of flies, but having a linux machine with such a vast ecosystem was amazing. On top of that, it was much easier to hire people to work on the project because now they could get by with higher level languages like python and javascript. And it was much, much, much faster to develop on.

The usage of the raspberry pi was, in my view, one of the key details that allowed for what ultimately became an extremely successful product. It was much less energy-efficient, but it was very simple to develop and iterate on. In the span of months we experimented with many hardware addons, as product-market-fit was still being found out, and the plethora of online resources for everything else was a boon.

I'm pretty sure this was _the_ project that really made me realize that more often than not the right solution is the one that lets the right people make the right decisions. And for that particular team, this was, without a doubt, a remarkably successful decision. Most of the problems that typically come with it (such as bloat, and inefficiency) were eventually solved, something which would not have been possible by going slowly at first.


A week? I've had some Pis lose usb in 1-2 days. Fortunately we could afford to make them self restart every couple hours.


I also had the same experience, but I could only make them restart during the night. So I wrote a monitor to check if any of the Pis lost USB before restarting.

When our business grew, even restarting every night, we would get one or two lost USB warnings every day. One day I didn't receive any warnings. I was really happy, I had fix the issue! Three days later a client calls screaming the service is not working for two whole days and we did nothing. After getting every Pi restarted, I went to check the monitor. Shut down. I asked my business partner about it. "The alarms made me anxious, so I decided to shut down the monitor".

Obviously I sold my shares and never looked back.


Ah well. Our project was Pis in a crappy mesh network so it lost data occasionally even if they stayed on, and it was not so important to have continous data anyway. We rebooted them every like 3 or 6 hours.


When I was 19, an ex-student of my Alma Mater came to give a talk about TDD. While I found the lecture interesting, I vividly remember that a portion of our community rallied against him, attempting to boycott his presence because he worked for Palantir.

At the time, I remember thinking how extreme that seemed, and how I was "sure" nothing is black-and-white and that, certainly, while Palantir had shady connections, for sure it must bring some good to the world and, so, why boycott this poor man? It felt genuinely baffling to me.

While in many ways I consider myself a more balanced person today (precisely thinking less in black-and-white terms), this is a topic where I do not agree. I would not work for Palantir and, were I to travel back in time, I would join the boycott. Heck, given how I was when I was younger, I'd expand on it greatly and try to rally some form of physical protest.

A friend of mine once threw me the argument of "well, the enemy [presumably China] is doing this kind of stuff, so we have to do it, too". This may seem like a compelling argument at first — and it may be so for many — but it can't, to me. It's ethically disgusting. The solution to world with decaying ethics is not to continue contributing to its decay. It erases accountability, it normalizes atrocity, it strips humanity from our very own flesh and blood — it escalates conflict! It. Just. Can't. be.

We must fight this filth.


Welcome to the downvote club. Anyone who criticizes tech oligarchs on here gets downvoted by bots.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: