I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”
The simplest explanation would be “You’re using it wrong…”, but I have the impression that this is not the primary reason. (Although, as an AI systems developer myself, you would be surprised by the number of users who simply write “fix this” or “generate the report” and then expect an LLM to correctly produce the complex thing they have in mind.)
It is true that there is an “upper management” hype of trying to push AI into everything as a magic solution for all problems. There is certainly an economic incentive from a business valuation or stock price perspective to do so, and I would say that the general, non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor.
While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.
Another theory: you have some spec in your mind, write down most of it and expect the LLM to implement it according to the spec. The result will be objectively a deviation from the spec.
Some developers will either retrospectively change the spec in their head or are basically fine with the slight deviation. Other developers will be disappointed, because the LLM didn't deliver on the spec they clearly hold in their head.
It's a bit like a psychological false memory effect where you misremember and/or some people are more flexibel in their expectations and accept "close enough" while others won't accept this.
This is true. But, it's also true of assigning tasks to junior developers. You'll get back something which is a bit like what you asked for, but not done exactly how you would have done it.
Both situations need an iterative process to fix and polish before the task is done.
The notable thing for me was, we crossed a line about six months ago where I'd need to spend less time polishing the LLM output than I used to have to spend working with junior developers. (Disclaimer: at my current place-of-work we don't have any junior developers, so I'm not comparing like-with-like on the same task, so may have some false memories there too.)
But I think this is why some developers have good experiences with LLM-based tools. They're not asking "can this replace me?" they're asking "can this replace those other people?"
> They're not asking "can this replace me?" they're asking "can this replace those other people?"
People in general underestimate other people, so this is the wrong way to think about this. If it can't replace you then it can't replace other people typically.
Yeah, I see quite a lot of misanthropy in the rhetoric people sometimes use to advance AI. I'll say something like "most people are able to learn from their mistakes, whereas an LLM won't" and then some smartass will reply "you think too highly of most people" -- as if this simple capability is just beyond a mere mortal's abilities.
This implies that it executes the spec correctly, just not in a way that's expected. But if you actually look at how these things operate, that's flat out not true.
Mitchell Hashimito just did a write up about his process for shipping a new feature for Ghostty using AI. He clearly knows what he's doing and follows all the AI "best practices" as far as I could tell. And while he very clearly enjoyed the process and thinks it made him more productive, the post is also a laundry list of this thing just shitting the bed. It gets confused, can't complete tasks, and architects the code in ways that don't make sense. He clearly had to watch it closely, step in regularly, and in some cases throw the code out entirely and write it himself.
The amount of work I've seen people describe to get "decent" results is absurd, and a lot of people just aren't going to do that. For my money it's far better as a research assistant and something to bounce ideas off of. Or if it is going to write something it needs to be highly structured input with highly structured output and a very narrow scope.
What I want to see at this point are more screencasts, write-ups, anything really, that depict the entire process of how someone expertly wrangles these products to produce non-trivial features. There's AI influencers who make very impressive (and entertaining!) content about building uhhh more AI tooling, hello worlds and CRUD. There's experienced devs presenting code bases supposedly almost entirely generated by AI, who when pressed will admit they basically throw away all code the AI generates and are merely inspired by it.
Single-shot prompt to full app (what's being advertised) rapidly turns to "well, it's useful to get motivated when starting from a blank slate" (ok, so is my oblique strategies deck but that one doesn't cost 200 quid a month).
This is just what I observe on HN, I don't doubt there's actual devs (rather than the larping evangelist AI maxis) out there who actually get use out of these things but they are pretty much invisible. If you are enthusiastic about your AI use, please share how the sausage gets made!
Important: there is a lot of human coding, too. I almost always go in after an AI does work and iterate myself for awhile, too.
Some people like to think for a while (and read docs) and just write it right at the first go. Some people like to build slowly and get a sense of where to go at each steps. But in all of those steps, there’s an heavy factor of expertise needed from the person doing the work. And this expertise does not comes for free.
I can use agentic workflow fine and generate code like any other. But the process is not enjoyable and there’s no actual gain. Especially in an entreprise settings where you’re going to use the same stack for years.
These things are amazing for maintenance programming on very large codebases (think, 50-100million lines of code or more, the people who wrote the code no longer work there, it's not open source so "just google it or check stack overflow" isn't even an option at all.)
A huge amount of effort goes into just searching for what relevant APIs are meant to be used without reinventing things that already exist in other parts of the codebase. I can send ten different instantiations of an agent off to go find me patterns already in use in code that should be applied to this spot but aren't yet. It can also search through a bug database quite well and look for the exact kinds of mistakes that the last ten years of people just like me made solving problems just like the one I'm currently working on. And it finds a lot.
Is this better than having the engineer who wrote the code and knows it very well? Hell no. But you don't always have that. And at the largest scale you really can't, because it's too large to fit in any one person's memory. So it certainly does devolve to searching and reading and summarizing for a lot of the time.
this is definitely closer to what I had in mind but it's still rather useless because it just shows what winning the lottery is like. what I am really looking for is neither the "Claude oneshot this" nor the "I gave up and wrote everything by hand" case but a realistic, "dirty" day-to-day work example. I wouldn't even mind if it was a long video (though some commentary would be nice in that case).
I don't think you should consider this as "winning the lottery", the author has been using these tools for a while.
The sibling comment with the writeup by the creator of Ghostty shows stuff in more detail and has a few cases of the agent breaking, though it also involves more "coding by hand".
I think the point is that you want to see typical results or process. How does it run when you use it 10 times, or 100 times, what results can you expect generally?
There's a lot of wishful thinking going around in this space and something more informative than cherrypicking is desperately needed.
Not least because lots of capable/smart people have no idea which way to jump when it comes to this stuff. They've trained themselves not to blindly hack solutions through trial and error but this essentially requires that approach to work.
I think your last point is pretty important, that all that we see is done by experienced people, and that today we don't have a good way to teaching "how to effectively use AI agents" other than saying to people "use them a lot, apply software engineering best practices like testing". That is a big issue, compounded because that stuff is new, there are lots of different tools, and they evolve all the time. I don't have a better answer here than "many programmers that I respect have tried using those tools and are sticking with it rather than going back" (with exceptions, like Karpathy's nanochat), and "the best way to learn today is to use them, a lot".
As for "what are they really capable of", I can't give a clear answer. They do make easy stuff easier, especially outside of your comfort zone, and seem to make hard stuff come up more often and earlier (I think because you do stuff outside your comfort zone/core experience zone ; or because you know have to think more carefully about design over a shorter period of time than before with less direct experience with the code, kind of like in Steve Yegge's case ; or because when hard stuff comes up it's stuff they are less good at handling so that means you can't use them).
The lower bound seems to be "small CLI tool", the higher bound seems to be "language learning app with paid users (sottaku I think? the dev talks on twitter. Lots of domain knowledge in japanese here to check the app itself) ; implementing a model on pytorch by someone that didn't know how to code before (00000005 seconds or something like this on twitter, has used all these models and tools a lot); reporting security issues that were missed in cURL", middle bound "very experienced dev shipping a feature faster and while doing other things on a semi mature codebase (Ghostty)", middle bound too is "useful code reviews". That's about the best I can give you I think.
I'm not sure if you just didn't understand what I'm looking for. If I'm searching for a good rails screencast to get a feeling for how it's used, a blogpost consisting of "rails new" is useless to me. I know that these tools can oneshot tasks, but this doesn't help me when they can't.
Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.
But the other thing is that, your expectations normalise, and you will hit its limits more often if you are relying on it more. You will inevitably be unimpressed by it, the longer you use it.
If I use it here and there, I am usually impressed. If I try to use it for my whole day, I am thoroughly unimpressed by the end, having had to re-do countless things it "should" have been capable of based on my own past experience with it.
> Well we are all doing different tasks on different codebases too. It's very often not discussed, even though it's an incredibly important detail.
Absolutely nuts I had to scroll down this far to find the answer.Totally agree.
Maybe it's the fact that every software development job has different priorities, stakeholders, features, time constraints, programming models, languages, etc. Just a guess lol
The simplest explanation is that most of us are code monkeys reinventing the same CRUD wheel over and over again, gluing things together until they kind of work and calling it a day.
"developers" is such a broad term that it basically is meaningless in this discussion
or, and get this, software development is an enormous field with 100s of different kinds of variations and priorities and use cases.
lol.
another option is trying to convince yourself that you have any idea what the other 2,000,000 software devs are doing and think you can make grand, sweeping statements about it.
there is no stronger mark of a junior than the sentiment you're expressing
For every coder doing some cutting-edge Computer SCIENCE there are 99 people creating one more CRUD API Glue application or microservice.
I've been doing this for 25 years and everything I do can be boiled down to API Glue.
Stuff comes in, code processes stuff, stuff goes out. Either to another system or to a database. I'm not breaking new ground or inventing new algorithms here.
The basic stuff has been the same for two decades now.
Maybe 5% of the code I write is actually hard, like when the stuff comes in REAL fast and you need to do the processing within a time limit. Or you need to get fancy with PostgreSQL queries to minimise traffic from the app layer to the database.
With LLM assistance I can have it do the boring 95% of scaffolding one more FoobarController.cs , write the models and the entity framework definitions while I browse Hacker News or grab a coffee and chat a bit. Then I have more time to focus on the 5% as well as more time to spend improving my skills and helping others.
Yes. I read the code the LLM produces. I've been here for a long time, I've read way more code than I've written, I'm pretty good at it.
> I've been doing this for 25 years and everything I do can be boiled down to API Glue.
Oooof, and you still haven't learned how big this field is? Give me the ego of a software developer who thinks they've seen it all in a field that changes almost daily. Lol.
> The basic stuff has been the same for two decades now.
hwut?
> Maybe 5% of the code I write is actually hard, like when the stuff comes in REAL fast and you need to do the processing within a time limit
God, the irony in saying something like this and not having the self-awareness to realize it's actually a dig at yourself. hahahahaha
Congratulations on being the most lame software developer on this planet who has only found himself in situations that can be solved by building strictly-CRUD software. Here's to hoping you keep pumping out those Wordpress plugins and ecommerce sites.
I have 2 questions for you to ruminate on:
1. How many programming jobs have you had?
2. How many programming jobs exist in the entire world at this moment?
It's gotta be what, a million job difference? lol. But you've seen it all right? hahahazha
I didn't say that there aren't people doing cutting edge stuff.
But even John Romero did the boring stuff along with the cool stuff. Andrej Karpathy wrote a ton of boilerplate Python to get his stuff up and running[0].
Or are you claiming that every single line of the nanochat[0] project is peak computer science algorithms no LLM can replicate today?
Take the initial commit tasks/ directory for example[1]. Dude is easily in the top 5 AI scientists in the world and he still spends a good time writing pretty basic string wrangling in Python.
My basic point here is that LLMs automate generating the boilerplate to a crazy degree, letting us spend more time in the bits that aren't boring and are actually challenging and interesting.
Well I know for a fact there are more code monkeys than rocket scientists working on advanced technologies. Just look at job offers really...
Anyone with any kind of experience in the industry should be able to tell that so idk where you're going with your "junior" comment. Technically I'm a senior in my company and I'm including myself in the code monkey category, I'm not working on anything revolutionary, as most devs are, just gluing things together, probably things that have been made dozens of times before and will be done dozens of time later... there is no shame in that, it's just the reality of software development. Just like most mechanics don't work on ferraris, even if mechanics working on ferraris do exist.
From my friends, working in small startups and large megacorps, no one is working on anything other than gluing existing packages together, a bit of es, a bit of postgres, a bit of crud, most of them worked on more technical things while getting their degrees 15 years ago than they are right now... while being in the top 5% of earners in the country. 50% of their job consist of bullshitting the n+1 to get a raise and some other variant of office politics
> From my friends, working in small startups and large megacorps, no one is working on anything other than gluing existing packages together,
And all my friends aren't doing that. So there's some anecdotal evidence to contradict yours.
And I think you're missing the point.
The point is the field is way bigger than either of us could imagine. You could have decades of experience and still only touch a small subset of the different technologies and problems.
> Well I know for a fact there are more code monkeys than rocket scientists working on advanced technologies
I don't know what this means as it doesn't disprove that fact that the field is enormous. Of course not everyone is working on rockets. But that is irrelevant.
> 50% of their job consist of bullshitting the n+1 to get a raise and some other variant of office politics
Again, this doesn't mean we aren't working on different things.
I actually totally agree with this point made in your previous post:
> "developers" is such a broad term that it basically is meaningless in this discussion
But your follow-up feels antogonistic to that point.
I'm convinced it's not the results that are different, it's the expectations.
The SVP of IT for my company is 100% in on AI. He talks about how great it is all the time. I just recently worked on a legacy project in PHP he build years ago, and now I know his bar for what quality code looks like is extremely low...
I use LLMs daily to help with my work, but I tweak the output all the time because it doesn't quite get it right.
Bottom line, if your code is below average AI code will look great.
That seems to be the pitch too. I now get google ads where they advertise you can ask your phone things about what it sees.
All the examples are so trivial, I can’t believe it. How to make a double espresso. What are these clouds called?
That’s being not a complete idiot as a service.
If it was at least how do I start the decalcification process on this machine so it actually realizes it and turns the service light off.
I would say they can't reliably deliver simple work. They often can, but reliability, to me, means I can expect it to work every time. Or at least as much as any other software tool, with failure rates somewhere in the vicinity of 1 in 10^5, 1 in 10^6. LLMs fail on the order of 1 in 10 times for simple work. And rarely succeed for complex work.
That is not reliable, that's the opposite of reliable.
One has to look at the alternatives. What would i do if not use the LLM to generate the code? The two answers are “coding myself”, “asking an other dev to code it”. And neither of those approach anywhere a 10^5 failure rate. Not even close.
> I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”
Some possible reasons:
* different models used by different folks, free vs paid ones, various reasoning effort, quantizations under the hood and other parameters (e.g. samplers and temperature)
* different tools used, like in my case I've found Continue.dev to be surprisingly bad, Cline to be pretty decent but also RooCode to be really good; also had good experiences with JetBrains Junie, GitHub Copilot is *okay*, but yeah, lots of different options and settings out there
* different system prompts, various tool use cases (e.g. let the model run the code tests and fix them itself), as well as everything ranging from simple and straightforward codebases that are dime a dozen out there (and in the training data), vs something genuinely new that would trip up both your average junior dev, as well as the LLMs
* open ended vs well specified tasks, feeding in the proper context, starting new conversations/tasks when things go badly, offering examples so the model has more to go off of (it can predict something closer to what you actually want), most of my prompts at this point are usually multiple sentences, up to a dozen, alongside code/data examples, alongside prompting the model to ask me questions about what I want before doing the actual implementation
* also sometimes individual models produce output for specific use cases badly, I generally rotate between Sonnet 4.5, Gemini Pro 2.5, GPT-5 and also use Qwen 3 Coder 480B running on Cerebras for the tasks I need done quickly and that are more simple
With all of that, my success rate is pretty great and the statement about the tech not being able to "...barely follow a simple instruction" holds untrue. Then again, most of my projects are webdev adjacent in mostly mainstream stacks, YMMV.
> Then again, most of my projects are webdev adjacent in mostly mainstream stacks
This is probably the most significant part of your answer. You are asking it to do things for which there are a ton of examples of in the training data. You described narrowing the scope of your requests too, which tends to be better.
It's true though, they can't. It really depends on what they have to work with.
In the fixed world of mathematics, everything could in principle be great. In software, it can in principle be okay even though contexts might be longer. When dealing with new contexts in something like real life, but different-- such as a story where nobody can communicate with the main characters because they speak a different language, then the models simply can't deal with it, always returning to the context they're familiar with.
When you give them contexts that are different enough from the kind of texts they've seen, they do indeed fail to follow basic instructions, even though they can follow seemingly much more difficult instructions in other contexts.
> I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”
My hypothesis is that developers work on different things and while these models might work very well for some domains (react components?) they will fail quickly in others (embedded?). So one one side we have developers working on X (LLM good at it) claiming that it will revolutionize development forever and the other side we have developers working on Y (LLM bad at it) claiming that it's just a fad.
I think this is right on, and the things that LLM excels at (react components was your example) are really the things that there's just such a ridiculous amount of training data for. This is why LLMs are not likely to get much better at code. They're still useful, don't get me wrong, but they 5x expectations needs to get reined in.
A breadth and depth of training data is important, but modern models are excellent at in-context learning. Throw them documentation and outline the context for what they're supposed to do and they will be able to handle some out-of-distribution things just fine.
I would love to see some detailed failure cases of people who used agentic LLMs and didn't make it work. Everyone is asking for positive examples, but I want to see the other side.
- on some topics, I get the x100 productivity that is pushed by some devs; for instance this Saturday I was able to make two features that I was reschudeling for years because, for lack of knowledge, it would have taken me many days to make them, but a few back and forth with an LLM and everything was working as expected; amazing!
- on other topics, no matter how I expose the issue to an LLM, at best it tells me that it's not solvable, at worst they try to push an answer that doesn't make any sense and push an even worst one when I point it out...
And when people ask me what I think about LLM, I say : "that's nice and quite impressive, but still it can't be blindly trusted and needs a lot of overhead, so I suggest caution".
I guess it's the classic half empty or half full glass.
>I think what we should really ask ourselves is: “Why do LLM experiences vary so much among developers?”
Two of the key skills needed for effective use of LLMs are writing clear specifications (written communication), and management, skills that vary widely among developers.
There’s no clearer specifications than code, and I can manage my toolset just fines (lines of config, alias, and what not to make my job easier). That allowed me to deliver good results fine and fast without worrying if it’s right this time
> B) Aren't experienced enough in the target domain to see the problems with the AI's output (let's call this "slop blindness")
Oh, boy, this. For example, I often use whatever AI I have to adjust my Nix files because the documentation for Nix is so horrible. Sure, it's slop, but it gets me working again and back to what I'm supposed to be doing instead of farting with Nix.
I would also argue:
D) The fact that an AI can do the task indicates that something with the task is broken.
If an AI can do the task well, there is something fundamentally wrong. Either the abstractions are broken, the documentation is horrible, the task is pure boilerplate, etc.
Yeah except I do that with Claude Code, and my output is still shit most of the time. It might save me a little time or typing, but it definitely needs hand-editing. The people who say Claude is a junior dev (at best) are right.
That's why I think a lot of people who think it's a miracle probably aren't experienced enough to see the bugs.
"expect an LLM to correctly produce the complex thing they have in mind"
My guess is that for some types of work people don't know what the complex thing they have in mind is ex ante. The idea forms and is clarified through the process of doing the work. For those types of task there is no efficiency gain in using AI to do the work.
"Just start iterating in chunks alongside the LLM".
For those types of tasks it probably takes the same amount of time to form the idea without AI as with AI, this is what Metr found in its study of developer productivity.
That study design has some issues. But let's say it takes me the same amount of time, the agentic flow is still beneficial to me. It provides useful structure, helps with breaking down the problem. I can rubber duck, send off web research tasks, come back to answer questions, etc., all within a single interface. That's useful to me, and especially so if you have to jump around different projects a lot (consultancy). YMMV.
> Why do LLM experiences vary so much among developers?
The question assumes that all developers do the same work. The kind of work done by an embedded dev is very different from the work of a front-end dev which is very different from the kind of work a dev at Jane Street does. And even then, devs work on different types of projects: greenfield, brownfield and legacy. Different kind of setups: monorepo, multiple repos. Language diversity: single language, multiple languages, etc.
Devs are not some kind of monolith army working like robots in a factory.
We need to look at these factors before we even consider any sort of ML.
I've known lots of people that don't know how to properly use Google, and Google has been around for decades. "You're using it wrong" is partially true, I'd say more something like "it is a new tool that changes very quickly, you have to invest a lot of time to learn how to properly use it, most people using it well have been using it a lot over the last two years, you won't catch up in an afternoon. Even after all that time, it may not be the best tool for every job" (proof on the last point being Karpathy saying he wrote nanochat mostly by hand).
It is getting easier and easier to get good results out of them, partially by the models themselves improving, partially by the scaffolding.
> non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor
This is a false dichotomy that assumes we know way more about intelligence than we actually do, and also assumes than what you need to ship lots of high quality software is "intelligence".
>While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.
"reliably" is doing a lot of work here. If it means "without human guidance" it is true (for now), if it means "without scaffolding" it is true (also for now), if it means "at all" it is not true, if it means it can't increase dev productivity so that they ship more at the same level of quality, assuming a learning period, it is not true.
I think those conversations would benefit a lot from being more precise and more focused, but I also realize that it's hard to do so because people have vastly different needs, levels of experience, expectations ; there are lots of tools, some similar, some completely different, etc.
To answer your question directly, ie “Why do LLM experiences vary so much among developers?”: because "developer" is a very very very wide category already (MISRA C on a car, web frontend, infra automation, medical software, industry automation are all "developers"), with lots of different domains (both "business domains" as in finance, marketing, education and technical domains like networking, web, mobile, databases, etc), filled with people with very different life paths, very different ways of working, very different knowledge of AIs, very different requirements (some employers forbid everything except a few tools), very different tools that have to be used differently.
This is very similar to Tesla's FSD adoption in my mind.
For some (me), it's amazing because I use the technology often despite its inaccuracies. Put another way, it's valuable enough to mitigate its flaws.
For many others, it's on a spectrum between "use it sometimes but disengage any time it does something I wouldn't do" and "never use it" depending on how much control they want over their car.
In my case, I'm totally fine handing driving off to AI (more like ML + computer vision) most times but am not okay handing off my brain to AI (LLMs) because it makes too many mistakes and the work I'd need to do to spot-check them is about the same as I'd need to put in to do the thing myself.
I know Car People who refuse to use even lane keeping assist, because it doesn't fit their driving style EXACTLY and it grates them immensely.
I on the other hand DGAF, I love how I don't need to mess with micro adjustments of the steering wheel on long stretches of the road, the car does that for me. I can spend my brainpower checking if that Gray VW is going to merge without an indicator or not.
Same with LLM, some people have a very specific style of code they want to produce and anything that's not exactly their style is "wrong" and "useless". Even if it does exactly what it should.
That's where I stand now. I use LLMs in some agentic coding way 10h/day to great avail. If someone doesn't see or realize the value, then that's their loss.
It’s because people are using different tiers of AI and different models. And many people don’t stick with it long enough to get a more nuanced outlook of AI.
Take Joe. Joe sticks with AI and uses it to build an entire project. Hundreds of prompts. Versus your average HNer who thinks he’s the greatest programmer in the company and thinks he doesn’t need AI but tries it anyway. Then AI fails and fulfills his confirmation bias and he never tries it again.
The simplest explanation would be “You’re using it wrong…”, but I have the impression that this is not the primary reason. (Although, as an AI systems developer myself, you would be surprised by the number of users who simply write “fix this” or “generate the report” and then expect an LLM to correctly produce the complex thing they have in mind.)
It is true that there is an “upper management” hype of trying to push AI into everything as a magic solution for all problems. There is certainly an economic incentive from a business valuation or stock price perspective to do so, and I would say that the general, non-developer public is mostly convinced that AI is actually artificial intelligence, rather than a very sophisticated next-word predictor.
While claiming that an LLM cannot follow a simple instruction sounds, at best, very unlikely, it remains true that these models cannot reliably deliver complex work.