If any non-trivial ask of an LLM also requires the prompts/scaffolding to be listed, and independently verified, along with its output, their utility is severely diminished. They should be saving time not giving us extra homework.
That isn't what I'm saying. I'm saying you can't make a blanket statement that LLMs in general aren't fit for some particular task. There are certainly tasks where no LLM is competent, but for others, some LLMs might be suitable while others are not. At least some level of detail beyond "they used an LLM" is required to know whether a) there was user error involved, or b) an inappropriate tool was chosen.
Are they? Every foundation model release includes benchmarks with different levels of performance in different task domains. I don't think I've seen any model advertised by its creating org as either perfect or even equally competent across all domains.
The secondary market snake oil salesmen <cough>Manus</cough>? That's another matter entirely and a very high degree of skepticism for their claims is certainly warranted. But that's not different than many other huckster-saturated domains.
People like Zuckerberg go around claiming most of their code will be written by AI starting sometime this year. Other companies are hearing that and using it as a reason(or false cover) for layoffs. The reality is LLMs still have a way to go before replacing experienced devs and even when they start getting there there will be a period of time where we’re learning what we can and can’t trust them with and how to use them effectively and responsibly. Feels like at least a few years from now, but the marketing says it’s now.
In many, many cases those problems are resolved by improvements to the model. The point is that making a big deal about LLM fuck ups in 3 year old models that don't reproduce in new ones is a complete waste of time and just spreads FUD.
Far better to just get these problems resolved.