Hacker Newsnew | past | comments | ask | show | jobs | submit | richardw's commentslogin

> I view LLMs akin to a dictionary

…If every time you looked at the dictionary it gave you a slightly different definition, and sometimes it gave you the wrong definition!


Go look up the same word across various dictionaries - they do not have a 1:1 copy of the descriptions of terms.

Reproducibility is a separate issue.


Dictionaries are not a great analogy, because the standout feature of LLMs is that their output can change based on the context provided by individual users.

Differences between dictionaries are decided by the authors and publishers of the dictionaries without taking individual user queries into account.


Totally. Surely the IDE’s like antigravity are meant to give the LLM more tools to use for eg refactoring or dependency management? I haven’t used it but seems a quick win to move from token generation to deterministic tool use.

As if. I’ve had Gemini stuck on AG because it couldn’t figure out how to use only one version of React. I managed to detect that the build failed because 2 versions of React were being used, but it kept saying “I’ll remove React version N”, and then proceeding to add a new dependency of the latest version. Loops and loops of this. On a similar note AG really wants to parse code with weird grep commands that don’t make any sense given the directory context.

It’s a $90k engineer that sometimes acts like a vandal, who never has thoughts like “this seems to be a bad way to go. Let me ask the boss” or “you know, I was thinking. Shouldn’t we try to extract this code into a reusable component?” The worst developers I’ve worked with have better instincts for what’s valuable. I wish it would stop with “the simplest way to resolve this is X little shortcut” -> boom.

It basically stumbles around generating tokens within the bounds (usually) of your prompt, and rarely stops to think. Goal is token generation, baby. Not careful evaluation. I have to keep forcing it to stop creating magic inline strings and rather use constants or config, even though those instructions are all over my Claude.md and I’m using the top model. It loves to take shortcuts that save GPU but cost me time and money to wrestle back to rational. “These issues weren’t created by me in this chat right now so I’ll ignore them and ship it.” No, fix all the bugs. That’s the job.

Still, I love it. I can hand code the bits I want to, let it fly with the bits I don’t. I can try something new in a separate CLI tab while others are spinning. Cost to experiment drops massively.


Claude code has those "thoughts" you say it never will. In plan mode, it isn't uncommon that it'll ask you: do you want to do this the quick and simple way, or would you prefer to "extract this code into a reusable component". It also will back out and say "Actually, this is getting messy, 'boss' what do you think?"

I could just be lucky that I work in a field with a thorough specification and numerous reference implementations.


I agree that Claude does this stuff. I also think the Chinese menus of options it provides are weak in their imagination, which means that for thoroughly specified problem spaces with reference implementations you're in good shape, but if you want to come up with a novel system, experience is required, otherwise you will end up in design hell. I think the danger is in juniors thinking the Chinese menu of options provided are "good" options in the first place. Simply because they are coherent does not mean they are good, and the combinations of "a little of this, a little of that" game of tradeoffs during design is lost.

This has happened to me too. Claude has stopped and said on occasions "this is a big refactor, and will affect UI as well. Do you want me to do it?"

I recently asked Claude to make some kind of simple data structure and it responded with something like "You already have an abstraction very similar to this in SourceCodeAbc.cpp line 123. It would be trivial to refactor this class to be more generic. Should I?" I was pretty blown away. It was like a first glimpse of an LLM play-acting as someone more senior and thoughtful than the usual "cocaine-fueled intern."

> sometimes acts like a vandal

I see you don't have experience working with a large number of real life humans.


Great. There’s no reason why all countries don’t start preferring locally or regionally developed software. Of course interoperability is always a thing but there needs to be another option between “one company” and “everyone host your own instance”.

I’m sad about Grok going to them, because the market needs the competition. But ASIC inference seems to require a simpler design than training does, so it’s easier for multiple companies to enter. It seems inevitable that competition emerges. And eg a Chinese company will not be sold to Nvidia.

What’s wrong with this logic? Any insiders willing to weigh in?


I'm not an insider, but ASICs come with their own suite of issues and might be obsolete if a different architecture becomes popular. They'll have a much shorter lifespan than Nvidia hardware in all likelihood, and will probably struggle to find fab capacity that puts them on equal footing in performance. For example, look at the GPU shortage that hit crypto despite hundreds of ASIC designs existing.

The industry badly needs to cooperate on an actual competitor to CUDA, and unfortunately they're more hostile to each other today than they were 10 years ago.


You can build ASICs to be a lot more energy efficient than current GPUs, especially if your power budget is heavily bound by raw compute as opposed to data movement bandwidth. The tradeoff is much higher latency for any given compute throughput, but for workloads such as training or even some kinds of "deep thinking inference" you don't care much about that.


Debian. Linux. Http protocol.


It has no sense of truth or value. You need to check what it wrote and you need to tell it what’s important to a human. It’ll give you the average, but misses the insight.


I’ve recently created many Claude skills to do repeatable tasks (architecture review, performance, magic strings, privacy, SOLID review, documentation review etc). The pattern is: when I’ve prompted it into the right state and it’s done what I want, I ask it to create a skill. I get codex to check the skill. I could then run it independently in another window etc and feed back to adjust…but you get the idea.

And almost every time it screws up we create a test, and often for the whole class of problem. More recent it’s been far better behaved. Between Opus, skills, docs, generating Mermaid diagrams, tests it’s been a lot better. I’ve also cleaned up so much of the architecture so there’s only one way to do things. This keeps it more aligned and helps with entropy. And they’ll work better as models improve. Having a match between code, documents and tests means it’s not just relying on one source.

Prompts like this seem to work: “what’s the ideal way to do this? Don’t be pragmatic. Tokens are cheaper than me hunting bugs down years later”


Can you tell me more about how you do tests? How do they look like? What testing tools or frameworks do you use?


This is smart as hell. I’ve long wondered how they’d combat ASIC’s without diluting their own benefits. This gives them a bit more time to figure out the moats, which is useful because Groq was going places. This juices Groq’s distribution, production, ability to access a wider range of skills where necessary.

I expect China to want to compete with this. Simpler than full-blown Nvidia chips. Cue much cheaper and faster inference for all.


Not terribly niche. All config that isn’t environment-specific and is used in inner loops or at startup. It’s even got a test for serialised values so can be used to speed your case up:

https://github.com/sebastienros/comptime/blob/main/test/Comp...

But you need to be sure you won’t want to change without compiling.


Well it also needs to be something that you need to generate/calculate, otherwise you would just write by hand the code that comptime outputs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: