> You are doing embedded development or anything else not as mainstream as web dev? LLMs are still useful but no longer mind blowing and often produce hallucinations.
I experienced this with Claude 4 Sonnet and, to some extent, gpt-5-mini-high.
When able to run tests against its output, Claude produces pretty good Rust backend and TypeScript frontend code. However, Claude became borderline unproductive once I started experimenting with uefi-rs. Other LLMs, like gpt-5-mini-high, did not fare much better, but they were at least capable of admitting lack of knowledge. In particular, GPT-5 would provide output akin to "here is some pseudocode that you may be able to adapt to your choice of UEFI bindings".
Testing in a UEFI environment is quite difficult; the LLM can't just run `cargo test` and verify its output. Things get worse in embedded, because crates like embedded_hal made massive API changes between 0.2 and 1.0 (the latest version), and each LLM I've tried seems to only have knowledge of 0.2 releases. Also, for embedded, forget even thinking about testing harnesses (which at least exist in some form with UEFI, it's just difficult to automate the execution and output for an LLM). In this case, you cannot really trust the output of the LLM. To minimize risk of hallucination, I would try maintaining data sheets and library code in context, but at that point, it took more time to prompt an LLM than handwrite code.
I've been writing a lot of embedded Rust over the past two weeks, and my usage of LLMs in general decreased because of that. Currently planning to resume development on some of my "easier" projects, since I have about 300 Claude prompts remaining in my Zed subscription, and I don't want them to go to waste.
This is where Rust's "if it compiles, it's probably correct" philosophy may come in handy.
"Shifting bugs left" is even more important for LLMs than it is for humans. There are certain tests LLMs can't run, so if we can detect bugs at compile time and run the LLM in a loop until things compile, that's a significant benefit.
My recent experience is that llms are dogshit at rust, though, unable to correct bugs without inserting new ones, going back and forth fixing and breaking the same thing, etc.
Sounds like the general "LLMs are net useful or not" sentiment here too. Personally Rust+LLMs work great, and workflow is rapid for as long as you can get the LLM to run one command to say "good or bad" without too much manually work, then it can iterate until it all works. Standard advice for prompting like "Don't make tests pass by changing assertions" tends to make the experience better too, but that's not Rust specific either.
> Also, for embedded, forget even thinking about testing harnesses (which at least exist in some form with UEFI, it's just difficult to automate the execution and output for an LLM).
I think this doesn't have to be like this and we can do better for this. If LLMs keep this up, good testing infrastructure might become more important.
One of my expectations for the future is the development of testing tools whose output is "optimized" in some way for LLM consumption. This is already occurring with Bun's test runner, for instance.[0] They are implementing a flag in the test runner so that the output is structured and optimized for token count.
Overall, I agree with your point. LLMs feel a lot more reliable when a codebase has thorough, easy-to-run tests. For a similar reason, I have been drifting towards strong, statically-typed languages. Both Rust and TypeScript have rich type systems that can express many kinds of runtime behavior with just types. When a compiler can make strong guarantees about a program's behavior, I assume that helps nudge the quality of LLM output a bit higher. Tests then help prevent silly regressions from occurring. I have no evidence for this besides my anecdotal experience using LLMs across several programming languages.
In general, I've had the best experience with LLMs when there's plenty of static analysis (and tests) on the codebase. When a codebase can't be easily tested, then I get much less productivity gains from LLMs. So yeah, I'm all for improving testing infrastructure.
I experienced this with Claude 4 Sonnet and, to some extent, gpt-5-mini-high.
When able to run tests against its output, Claude produces pretty good Rust backend and TypeScript frontend code. However, Claude became borderline unproductive once I started experimenting with uefi-rs. Other LLMs, like gpt-5-mini-high, did not fare much better, but they were at least capable of admitting lack of knowledge. In particular, GPT-5 would provide output akin to "here is some pseudocode that you may be able to adapt to your choice of UEFI bindings".
Testing in a UEFI environment is quite difficult; the LLM can't just run `cargo test` and verify its output. Things get worse in embedded, because crates like embedded_hal made massive API changes between 0.2 and 1.0 (the latest version), and each LLM I've tried seems to only have knowledge of 0.2 releases. Also, for embedded, forget even thinking about testing harnesses (which at least exist in some form with UEFI, it's just difficult to automate the execution and output for an LLM). In this case, you cannot really trust the output of the LLM. To minimize risk of hallucination, I would try maintaining data sheets and library code in context, but at that point, it took more time to prompt an LLM than handwrite code.
I've been writing a lot of embedded Rust over the past two weeks, and my usage of LLMs in general decreased because of that. Currently planning to resume development on some of my "easier" projects, since I have about 300 Claude prompts remaining in my Zed subscription, and I don't want them to go to waste.