I don't understand the productivity that people get out of these AI tools. I've tried it and I just can't get anything remotely worthwhile unless it's something very simple or something completely new being built from the ground up.
Like sure, I can ask claude to give me the barebones of a web service that does some simple task. Or a webpage with some information on it.
But any time I've tried to get AI services to help with bugfixing/feature development on a large, complex, potentially multi-language codebase, it's useless.
And those tasks are the ones that actually take up the majority of my time. On the occasion that I'm spinning a new thing up quickly, I don't really need an AI to do it for me -- I mean, that's the easy part!
Is there something I'm missing? Am I just not using it right? I keep seeing people talk about how addictive it is, how the productivity boost is insane, how all their code is now written by AI and then audited, and I just don't see how that's possible outside of really simple rote programming.
The first and most important question to ask here is: are you using a coding agent? A lot of times, people who aren't getting much out of LLM-assisted coding are just asking Claude or GPT for code snippets, and pasting and building them themselves (or, equivalently, they're using LLM-augmented autocomplete in their editor).
Almost everybody doing serious work with LLMs is using an agent, which means that the LLM is authoring files, linting them, compiling them, and iterating when it spots problems.
There's more to using LLMs well than this, but this is the high-order bit.
Funny, I would give the absolute opposite advice. In my experience, the use of agents (mainly Cursor) is a sure-fire way to have a really painful experience with LLM-assisted coding. I much prefer to use AI as a pair programmer, that I talk to and sometimes get to write entire files, but I'm always the one doing the driving, and mostly the one writing the code.
If you aren't building up mental models of the problem as you go, you end up in a situation where the LLM gets stuck at the edges of its capability, and you have no idea how even to help it overcome the hurdle. Then you spend hours backtracking through what it's done building up the mental model you need, before you can move on. The process is slower and more frustrating than not using AI in the first place.
I guess the reality is, your luck with AI-assisted coding really comes down to the problem you're working on, and how much of it is prior art the LLM has seen in training.
I mean, it might depend, but many of the most common complaints about LLM coding (most notably hallucination) are essentially solved problems if you're using agents. Whatever works for you! I don't even like autocomplete, so I sympathize with not liking agents.
If it helps, for context: I'll go round and round with an agent until I've got roughly what I want, and then I go through and beat everything into my own idiom. I don't push code I don't understand and most of the code gets moved or reworked a bit. I don't expect good structure from LLMs (but I also don't invest the time to improve structure until I've done a bunch of edit/compile/test cycles).
I think of LLMs mostly as a way of unsticking and overcoming inertia (and writing tests). "Writing code", once I'm in flow, has always been pleasant and fast; the LLMs just get me to that state much faster.
I'm sure training data matters, but I think static typing and language tooling matters much more. By way of example: I routinely use LLMs to extend intensely domain-specific code internal to our project.
> but many of the most common complaints about LLM coding (most notably hallucination) are essentially solved problems if you're using agents.
I have definitely not seen this in my experience (with Aider, Claude and Gemini). While helping me debug an issue, Gemini added a !/bin/sh line to the middle of the file (which appeared to break things), and despite having that code in the context didn't realise it was the issue.
OTOH, when asking for debugging advice in a chat window, I tend to get more useful answers, as opposed to a half-baked implementation that breaks other things. YMMV, as always.
> ...many of the most common complaints about LLM coding (most notably hallucination) are essentially solved problems if you're using agents
Inconsistency and crap code quality aren't solved yet, and these make the agent workflow worse because the human only gets to nudge the AI in the right direction very late in the game. The alternative, interactive, non-agentic workflows allow for more AI-hand-holding early, and better code quality, IMO.
Agents are fine if no human is going to work on the (sub)system going forward, and you only care about the shiny exterior without opening the hood to witness the horrors within.
Cursor is pretty bad in my experience. I don't know why because I find Windsurf better and they both use Claude.
Regardless, Gemini 2.5 Pro is far far better and I use that with open-source free Roo Code. You can use the Gemini 2.5 Pro experimental model for free (rate limited) to get a completely free experience and taste for it.
Cursor was great and started is off, but others took notice and now they're all more or less the same. It comes down to UX and preference, but I think Windsurf and Roo Code just did a better job here than Cursor, personally.
Agree. My favorite workflow has been chatting with the LLM in the assistant panel of Zed, then making inline edits by prompting the AI with the context of that chat. That way, I can align with the AI on how the problem should be solved before letting it loose. What's great about this depending on how easy or hard the problem is for the LLM, I can shift between handholding / manual coding and vibe coding.
Agents make it easier for you to give context to the LLM, or for it to grab some by itself like Cline/Claude/Cursor/Windsurf can do.
With a web-based system you need repomix or something similar to give the whole project (or parts of it if you can be bothered to filter) as context, which isn't exactly nifty
I think they're all fine. Cursor is popular and charges a flat fee for model calls (interposed through their model call router, however that works). Aider is probably the most popular open source command line one. Claude Code is probably the most popular command line agent overall; Codex is the OpenAI equivalent (I like Codex fine).
later
Oh, I like Zed a lot too. People complain that Zed's agent (the back-and-forth with the model) is noticeably slower than the other agents, but to me, it doesn't matter: all the agents are slow enough that I can't sit there and wait for them to finish, and Zed has nice desktop notifications for when the agent finishes.
Plus you get a pretty nice editor --- I still write exclusively in Emacs, but I think of Zed as being a particularly nice code UI for an LLM agent.
I've settled on Cline for now, with openrouter as the backend for LLMs, Gemini 2.5 for planning and Claude 3.7 for act mode.
Cursor is fine, Claude Code and Aider are a bit too janky for me - and tend to go overboard (making full-ass git commits without prompting) and I can't be arsed to rein them in.
Speaking up for Devin.ai here. What I like about it is that after the initial prompt nearly all of my interaction with it is via pull request comments.
I have this workflow where I trigger a bunch of prompts in the morning, lunch and at the end of the day. At those same times I give it feedback. The async nature really means I can have it work on things I can’t be bothered with myself.
Oh they aren’t like time based instructions or anything. First thing I do when I sit down in the morning is go through the list of tasks I thought up overnight and fire devin at them. Then I go do whatever “real” work I needed to get done. Then at lunch I check in to see how things are going and give feedback or new tasks. Same as the last thing I do at night.
It keeps _me_ from context switching into agent manager mode. I do the same thing for doing code reviews for human teammates as well.
Right, no, I figured that! Like the idea of preloading a bunch of things into a model that I don't have the bandwidth to sort through, but having them on tap when I come up for air from whatever I'm currently working on, sounds like a super good trick.
That’s kind of where Devin excels. The agent itself is good enough, I don’t even know what model it uses. But it’s hosted and well integrated with GitHub, so you just give it a prompt and out shoots a pr sometime later. You comment on the pr and it refines it. It has a concept of “sessions” so you can start many of those tasks at once. You can login to each of its tasks and see what it is doing or interdict, but I rarely do.
Like most of the code agents it works best with tight testable loops. But it has a concept of short vs long tests and will give you plans as nd confidence values to help you refine your prompt if you want.
I tend to just let it go. If it gets to a 75% done spot that isn’t worth more back and forth I grab the pr and finish it off.
Yesterday I gave cursor a try and made my first (intentionally very lazy) vibe coding approach (a simple threejs project). It accepted the task and did things, failed, did things, failed, did things ... failed for good.
I guess I could work on the magic incantations to tweak here and there a bit until it works and I guess that's the way it is done. But I wasn't hooked.
I do get value out of LLM's for isolated broken down subtasks, where asking a LLM is quicker than googling.
For me, AI will probably become really usefull, once I can scan and integrate my own complex codebase so it gives me solutions that work there and not hallucinate API points or jump between incompatible libary versions (my main issue).
I did almost the same thing and had pretty much the same experience. A lot of times it felt so close to being great, but it ultimately wasted more time than if I had just worked on the project and occasionally asked chat GPT to generate some sample code to figure out an API
I've had the same issue every time I've tried it. The code I generally work on is embedded C/C++ with in-house libraries where the tools are less than useful as they try to generate non-existant interfaces and generally generate worse code than I'd write by hand. There's a need for correctness and being able to explain the code thus use of those tools is also detrimental to explainability unless I hand-hold it to the point where I'm writing all of the code myself.
Generating function documentation hasn't been that useful either as the doc comments generated offer no insight and often the amount I'd have to write to get it to produce anything of value is more effort than just writing the doc comments myself.
For my personal project in zig they either get lost completely or gives me terrible code (my code isn't _that_ bad!). There seems to be no middle ground here. I've even tried the tools as pair programmers but they often get lost or stuck in loops of repeating the same thing that's already been mentioned (likely falls out of the context window).
When it comes to others using such tools I've had to ask them to stop using it to think as it becomes next to impossible to teach / mentor if they're passing that I say to the LLM or trying to have it perform the work. I'm confident in debugging people when it comes to math / programming but with an LLM between it's just not possible to guess where they went wrong or how to bring them back to the right path as the throught process is lost (or there wasn't one to begin with).
This is not even "vibe coding", I've just never found it generally useful enough to use day-to-day for any task and my primary use of say phind has been to use it as an alternative to qwant when I cannot game the search query well enough to get the search results I'm looking for (i.e I ignore the LLM output and just look at the references).
> I've had the same issue every time I've tried it. The code I generally work on is embedded C/C++ with in-house libraries where the tools are less than useful as they try to generate non-existant interfaces and generally generate worse code than I'd write by hand.
That's because whatever training the model had, it didn't covered anything remotely similar to the codebase you worked on.
We get this issue even with obscure FLOSS libraries.
When we fail to provide context to LLMs, they generate examples by following supperficial queues like coding conventions. In extreme cases, such as code that employs source code generators or templates, LLMs even fill in function bodies that code generators are designed to generate for you. That's because, if LLMs are oblivious to the context, they resort to hallucinate their way into something seemingly coherent. Unless you provide them with context or instruct them not to make up stuff, they will resort to bullshit their way into an example.
What's truly impressive about this is that often times the hallucinated code actually works.
> Generating function documentation hasn't been that useful either as the doc comments generated offer no insight and often the amount I'd have to write to get it to produce anything of value is more effort than just writing the doc comments myself.
Again,this suggest a failure on your side for not providing any context.
If you give it enough context LLMs synthesize and present them almost instantly. If you're prompting a LLM to generate documentation, which boils down to synthesizing what an implementation does and what's their purpose,and the LLM comes up empty, that means you failed to give it anything to work on.
The bulk of your comment screams failure to provide any context. If your code steers far away from what it expects, fails to follow any discernible structure, and doesn't even convey purpose and meaning in little things like naming conventions, you're not giving the LLM anything to work on.
I'm aware of the _why_ but this is why the tools aren't useful for my case. If they cannot consume the codebase in a reasonable amount of time and provide value from that then they generally aren't useful in areas where I would want to use them (navigating large codebases). If the codebase is relatively small or the problem is known then an LLM is not any better than tab-complete and arguably worse in many cases as the generated result has to be parsed and added to my mental model of the problem rather than the mental model being constructed while working on the code itself.
I guess my point is, I have no use for LLMs in their current state.
> That's because whatever training the model had, it didn't covered anything remotely similar to the codebase you worked on.
> We get this issue even with obscure FLOSS libraries.
This is the issue however as unfamiliar codebases is exactly where I'd want to use such tooling. Not working in those cases makes it less than useful.
> Unless you provide them with context or instruct them not to make up stuff, they will resort to bullshit their way into an example.
In all cases context was provided extensively but at some point it's easier to just write the code directly. The context is in surrounding code which if the tool cannot pick up on that when combined with direction is again less than useful.
> What's truly impressive about this is that often times the hallucinated code actually works.
I haven't experienced the same. It fails more often than not and the result is much worse than the hand-written solution regardless of the level of direction. This may be due to unfamiliar code but again, if code is common then I'm likely familiar with it already thus lowering the value of the tool.
> Again,this suggest a failure on your side for not providing any context.
This feels like a case of blaming the user without full context of the situation. There are comments, the names are descriptive and within reason, and there's annotation of why certain things are done the way they are. The purpose of a doc comment is not "this does X" but rather _why_ you want to use this function and it's purpose which is something LLMs struggle to derive from my testing of them. Adding enough direction to describe such is effectively writing the documentation with a crude english->english compiler between. This is the same problem with unit test generation where unit tests are not to game code coverage but to provide meaningful tests of the domain and known edge cases of a function which is again something the LLM struggles with.
For any non-junior task LLM tools are practically useless (from what I've tested) and for junior level tasks it would be better to train someone to do better.
> I'm aware of the _why_ but this is why the tools aren't useful for my case. If they cannot consume the codebase in a reasonable amount of time and provide value from that then they generally aren't useful in areas where I would want to use them (navigating large codebases).
I challenge you to explore different perspectives.
You are faced with a service that handles any codebase that's thrown at it with incredible ease, without requiring any tweaking or special prompting.
For some reason, the same system fails to handle your personal codebase.
What's the root cause? Does it lie in the system that works everywhere with anything you throw at it? Or is it in your codebase?
Well, what would I have to do to please the LLM? Writing code isn't for LLMs to consume but rather to communicate intent for people and for machines to run which provides value for the user at the end of the day. If an LLM fails at being useful within a codebase when it's supposed to be a "works anywhere" tool then the tool is less than useful.
Note that language servers, static analysis tooling, and so on still work without issue.
The cause (which is my assumption) is that there aren't enough good examples in the training set for anything useful to be the most likely continuation thus leading to a suboptimal result given the domain. Thus the tool doesn't work "everywhere" for cases where there's less use of a language or less code in general dealing with a particular problem.
> Is there something I'm missing? Am I just not using it right?
The talk about it makes more sense when you remember most developers are primarily writing CRUD webapps or adware, which is essentially a solved problem already.
Just to say: setting up oauth and telemetry are harder than building CRUD, because it comes later in the learning progression. But it’s still mostly just following a recipe; LLMs are good at finding recipes that are uncommon, but available in the dataset. They break more when you go off script and you don’t take tiny steps.
I'm laughing at the idea of server telemetry being "the data center equivalent of a CRUD app". I'll just pull out my Telemetry On Rails framework and run some generators...
I'd rather write a new Device Mapper target than do either of those things, and LLMs have been helpful with that too. Is a Device Mapper target the "CRUD app of the Linux kernel"?
> I'd rather write a new Device Mapper target than do either of those things
Perhaps it’s time for a career change then. Follow your joy and it will come more naturally for you to want to spread it.
Again,
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
From my reading the “strongest possible interpretation” of the original “CRUD app” line was “it’s a solved problem that largely requires rtfm and rote execution of well worn patterns in code structure” making it similarly situated as “server telemetry” to make llms appear superintelligent to people new to programming within those paradigms.
I’m unfamiliar with “device mapping”, so perhaps someone else can confirm if it is “the crud app of Linux kernel dev” in that vein.
Just listing topics in software development is hardly evidence of either your own ability to work on them, or of their inherent complexity.
Since this seems to have hurt your feelings, perhaps a more effective way to communicate your needs would be to explain why you find “server telemetry” to be more difficult/complex/w/e to warrant needing an llm for you to be able to do it.
I work on a VSCode debugger extension that uses gdb as a backend, and I've had some success tackling bugs in that using Cursor + Gemini 2.5 Pro. The codebase is in TypeScript.
Based from what I've seen, Python and TypeScript are where it fares best. Other languages are much more hit and miss.
They aren't increasing productivity. In the short term.
They are very handy tools that can help you learn a foreign code/base faster. They can help you when you run into those annoying blockers that usually take hours or days or a second set of eyes to figure out. They give you a sounding board and help you ask questions and think about the code more.
Big IF here. IF you bother to read. The danger is some people just keep clicking and re-prompting until something works, but they have zero clue what it is and how it works. This is going to be the biggest problem with AI code editors. People just letting Jesus take the wheel and during this process, inefficient usage of the tools will lead to slower throughput and a higher bill. AI costs a good chunk of change per token and that's only going up.
I do think it's addictive for sure. I also think the "productivity boost" is a feeling people get, but no one measures. I mean, it's hard to measure. Then again, if you do spend an hour on a problem you get stuck on vs 3 days then sure it helped productivity. In that particular scenario. Averaged out? Who knows.
They are useful tools, they are just also very misunderstood and many people are too lazy to take the time to understand them. They read headlines and unsubstantiated claims and get overwhelmed by hype and FOMO. So here we are. Another tech bubble. A super bubble really. It's not that the tools won't be with us for a long time or that they aren't useful. It's that they are way way overvalued right now.
I appreciate you voicing your feelings here. My previous employer requested we try AI tooling for productivity purposes, and I was finding myself in similar scenarios to what you mention. The parts that would have benefitted from a productivity gain weren’t seeing any improvement, while the areas that saw a speedup weren’t terribly mission-critical.
The one thing I really appreciated though was the AI’s ability to do a “fuzzy” search in occasional moments of need. Or, for example, sometimes the colloquial term for a feature didn’t match naming conventions in source code. The AI could find associations in commit messages and review information to save me time rummaging through git-blame. Like I said though, that sort of problem wasn’t necessarily a bottleneck and could often be solved much more cheaply by asking around coworker on Slack.
Probably 80% of the time I spend coding, I'm inside a code file I haven't read in the last month. If I need to spend more than 30 seconds reading a section of code before I understand it, I'll ask AI to explain it to me. Usually, it does a good job of explaining code at a level of complexity that would take me 1-15 minutes to understand, but does a poor job of answering more complex questions or at understanding more complex code.
It's a moderately useful tool for me. I suspect the people that get the most use out of are those that would take more than 1 hour to read code I would take 10 minutes to read. Which is to say the least experienced people get the most value.
I find it's incredibly helpful for prototyping. These tools quickly reach a limit of complexity and put out sub par code, but for a green field prototype that's ok.
I've successfully been able to test out new libraries and do explorations quickly with AI coding tools and I can then take those working examples and fix them up manually to bring them up to my coding standards. I can also extend the lifespan of coding tools by doing cleanup cycles where I manually clean up the code since they work better with cleaner encapsulation, and you can use them to work on one scoped component at a time.
I've found that they're great to test out ideas and learn more quickly, but my goal is to better understand the technologies I'm prototyping myself, I'm not trying to get it to output production quality code.
I do think there's a future where LLMs can operate in a well architected production codebase with proper type safe compilation, linting, testing, encapsulation, code review, etc, but with a very tight leash because without oversight and quality control and correction it'll quickly degrade your codebase.
Incredibly useful for 'glue code' or internal apps that are for automating really annoying processes - but where normally the time it would take to develop those tools would add up and take away from the core work.
For instance, dealing with files that don't quite work correctly between two 3D applications because of slightly different implementations. Ask for a python script to patch the files so that they work correctly – done almost instantly just by describing the problem.
Also for prototyping. Before you spend a month crafting a beautiful codebase, just get something standing up so you can evaluate whether it's worth spending time on – like, does the idea have legs?
90% of programming problems get solved with a rubber ducky – and this is another valuable area. Even if the AI isn't correct, often times just talking it through with an LLM will get you to see what the solution is.
I’ve had good experiences using it, but with the caveat that only Gemini Pro 2.5 has been at all useful, and only for “spot” tasks.
I typically use it to whip up a CLI tool or script to do something that would have been too fiddly otherwise.
While sitting in a Teams meeting I got it to use the Roslyn compiler SDK in a CLI tool that stripped a very repetitive pattern from a code base. Some OCD person had repeated the same nonsense many thousands of times. The tool cleaned up the mess in seconds.
It's like a magic incantation to make the errors go away (it doesn't actually), probably by someone used to Visual Basic's "ON ERROR RESUME NEXT" or some such.
For me it's an additional tool, not the only tool. LSPs are still better for half of the things I do daily (renaming things, extracting things, finding symbols, etc.). I can't imagine using AI for everything and even meeting my current velocity.
I swear the people posting this stuff are just building todo list tier apps from scratch. On even the most moderately large codebase it completely breaks down.
I find that token memory size limits are the main barrier here.
Once the LLM starts forgetting other parts of the application, all bets are off and it will hallucinate the dumbest shit, or even just remove features wholesale.
For my workflow a lot of the benefit is in the smaller tasks. For example, get a medium size diff from two change-sets, paste it in, and ask to summarize the “what” and “why”. You have to learn how to give AI the right amount of context to get the right result back.
For code changes I prefer to paste a single function in, or a small file, or error output from a compile failure. It’s pretty good at helping you narrow things down.
So, for me, it’s a pile of small gains where the value is—because ultimately I know what I generally want to get done and it helps me get there.
Just yesterday I wanted to adjust the Renovate setup in a repo.
I could spend 5-10 minutes digging on through the docs for the correct config option, or I can just tap a hotkey, open up GitHub Copilot in Rider and tell it what I want to achieve.
And within seconds it had a correct-looking setting ready to insert to my renovate.json file. I added it, tested it and it works.
I kinda think people who diss AIs are prompting something like "build me Facebook" and then being disappointed when it doesn't :D
Like sure, I can ask claude to give me the barebones of a web service that does some simple task. Or a webpage with some information on it.
But any time I've tried to get AI services to help with bugfixing/feature development on a large, complex, potentially multi-language codebase, it's useless.
And those tasks are the ones that actually take up the majority of my time. On the occasion that I'm spinning a new thing up quickly, I don't really need an AI to do it for me -- I mean, that's the easy part!
Is there something I'm missing? Am I just not using it right? I keep seeing people talk about how addictive it is, how the productivity boost is insane, how all their code is now written by AI and then audited, and I just don't see how that's possible outside of really simple rote programming.