> People have made the point many times before, that “hallucination” is mainly w...

simonw · 2025-04-12T15:18:50 1744471130

One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.

Once you figure out how to do that they're absurdly useful.

Maybe a good analogy here is working with animals? Guide dogs, sniffer dogs, falconry... all cases where you can get great results but you have to learn how to work with a very unpredictable partner.

ToucanLoucan · 2025-04-12T15:29:48 1744471788

> One of the main reason LLMs are unintuitive and difficult to use is that you have to learn how to get useful results out of fundamentally unreliable technology.

Name literally any other technology that works this way.

> Guide dogs, sniffer dogs, falconry...

Guide dogs are an imperfect solution to an actual problem: the inability for people to see. And dogs respond to training far more reliably than LLMs respond to prompts.

Sniffer dogs are at least in part bullshit and have been shown in many studies to respond to the subtle cues of their handlers far more reliably than anything they actually smell. And the best of part of them is they also (completely outside their own control mind you) ruin lives when falsely detecting drugs on cars that look a way the officer handling them thinks means they have drugs inside.

And falconry is a hobby.

simonw · 2025-04-12T15:39:05 1744472345

"Name literally any other technology that works this way"

Since you don't like my animal examples, how about power tools? Chainsaws, table saws, lathes... all examples of tools where you have to learn how to use them before they'll be useful to you.

(My inability to come up with an analogy you find convincing shouldn't invalidate my claim that "LLMs are unreliable technology that is still useful if you learn how to work with it" - maybe this is the first time that's ever been true for an unreliable technology, though I find that doubtful.)

marcosdumay · 2025-04-12T15:42:02 1744472522

The correct name for unreliable power tools is "trash".

blibble · 2025-04-12T16:03:14 1744473794

which happens to be the correct name for A"I" too

netruk44 · 2025-04-12T15:46:41 1744472801

> Name literally any other technology that works this way.

The internet for one.

Not the internet itself (although it certainly can be unreliable), but rather the information on it.

Which I think is more relevant to the argument anyway, as LLM’s do in fact reliably function exactly the way they were built to.

Information on the internet is inherently unreliable. It’s only when you consider externalities (like reputation of source) that its information can then be made “reliable”.

Information that comes out of LLM’s is inherently unreliable. It’s only through externalities (such as online research) that its information can be made reliable.

Unless you can invent a truth machine that somehow can tell truth from fiction, I don’t see either of these things becoming reliable, stand-alone sources of information.

Tainnor · 2025-04-13T07:57:38 1744531058

> Name literally any other technology that works this way.

Probabilistic prime number tests.

I'm being slightly facetious. Such tests differ from LLMs in the crucial respect that we can quantify their probability of failure. And personally I'm quite skeptical of LLMs myself. Nevertheless, there are techniques that can help us use unreliable tools in reliable ways.

bluesnowmonkey · 2025-04-12T17:06:27 1744477587

> Name literally any other technology that works this way.

How about people? They make mistakes all the time, disobey instructions, don’t show up to work, occasionally attempt to embezzle or sabotage their employers. Yet we manage to build huge successful companies out of them.

mdp2021 · 2025-04-12T16:55:08 1744476908

> Once you figure out how to do that they're absurdly useful

I have read some posts of yours advancing that but I never met those with the details: do you mean more "prompt engineering", or "application selection", or "system integration"...?

simonw · 2025-04-12T17:09:07 1744477747

Typing code faster. Building quick illustrative prototypes. Researching options for libraries (that are old and stable enough to be in the training data). Porting code from one language to another (surprisingly) [1]. Using as a thesaurus. Answering questions about code (like piping in a whole codebase and asking about it) [2]. Writing an initial set of unit tests. Finding the most interesting new ideas in a paper or online discussion thread without reading the whole thing. Building one-off tools for converting data. Writing complex SQL queries. Finding potential causes of difficult bugs. [3]

[1] I built https://tools.simonwillison.net/hacker-news-thread-export this morning from my phone using that trick: https://claude.ai/share/7d0de887-5ff8-4b8c-90b1-b5d4d4ca9b84

[2] Examples of that here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/#b...

[3] https://simonwillison.net/2024/Sep/25/o1-preview-llm/ is an early example of using a "reasoning" model for that

Or if you meant "what do you have to figure out to use them effectively despite their flaws?", that's a huge topic. It's mostly about building a deep intuition for what they can and cannot help with, then figuring out how to prompt them (including managing their context of inputs) to get good results. The most I've written about that is probably this piece: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

mdp2021 · 2025-04-12T17:20:40 1744478440

All of that is very interesting. Side note: don't you agree that "answering about documentation with 100% reliability" would be a more than desirable further feature? (Think of those options in the shell commands which can be so confusing they made it to xkcd material.) But that would mean achieving production-level RAG; and that in turn would be a revolution in LLMs, which would revise your list above...

simonw · 2025-04-12T17:46:25 1744479985

LLMs can never provide 100% reliability - there's a random number generator in the mix after all (reflected in the "temperature" setting).

For documentation answering the newer long context models are wildly effective in my experience. You can dump a million tokens (easily a full codebase or two for most projects) into Gemini 2.5 Pro and get great answers to almost anything.

There are some new anonymous preview models with 1m token limits floating around right now which I suspect may be upcoming OpenAI models. https://openrouter.ai/openrouter/optimus-alpha

I actually use LLMs for command line arguments for tools like ffmpeg all the time, I built a plugin for that: https://simonwillison.net/2024/Mar/26/llm-cmd/

mdp2021 · 2025-04-13T00:37:39 1744504659

> random number generator

But the use of randomness inside the system should not prevent, in theory, as-if-full reliability - this stresses that the architecture could be unfinished, as I expressed with the example of RAG. (E.g.: well trained natural minds use check systems over provisional output, however obtained.)

> newer long context models

Practical question: if the query-contextual documentation needs to be part of the input (I am not aware of a more efficient way), does not that massively impact the processing time? Suppose you have to examine interactively the content of a Standard Hefty Document of 1MB of text... If so, that would make local LLM use prohibitive.

simonw · 2025-04-13T03:49:39 1744516179

Longer context is definitely slower, especially for local models. Hosted models running on who knows what kind of overpowered hardware can crunch through them pretty fast though. There's also token caching available for OpenAI, Anthropic, Gemini and DeepSeek which can dramatically speed up processing of long context prompts if they've been previously cached.