This has been my thinking as well: the natural language interface is amazing and something we've been wanting for some time.
The generation is a showy gimmick.
So why aren't we separating the useful bit out? My sneaking suspicion is that we can't. It's a package deal in that there are no two parts, it's just one big soup of free-associating some text with other text, the stochastic parrot.
LLM do not understand. They generate associated text that looks like an answer to the question. In order to separate the two parts, we'd need LLMs that understand.
I've been working on this as a way of better exposing organisational knowledge recently. A few things I've observed:
* The prompt engineering side is a black art.
* Generation is pretty amazing but less useful than it first seems.
* The open source and proprietary tooling around managing and interrogating datasets for NLP stuff is way ahead of where it was last time I looked maybe a decade ago
* It looks to me that with appropriate thinking about indexing then there's a lot of potential here. For example if I have a question answer set, I can index the question and which answers point to it. Get the LLM to identify the type of question I've just asked, and then use my corresponding corpus of answers to provide something useful based on the appropriate parts of the answer corpus.
That last bit is not the automatic panacea that the flashy and somewhat gimicky emergent properties of the generative side supply, but it seems like it can get good traction on some quite difficult problems quickly, and actually quite well using not too many local computing resources.
Generation isn't less useful, it is less autonomous than it seems. You need to impose a process on generations, and generate inside a framework. Think about humans - the best authors and artists have a methodical process for producing and refining their work, and if you forced them to just generate stuff with no process in 1 shot they would probably produce sub-standard output. No surprise machines that aren't at our level fare no better given the same conditions.
I wouldn't say we can't, but it would be much harder. The models were trained and optimized for next word prediction. It is possible to chop the output layers off and replace them with something else. This is often how more open models like BERT are adapted to tasks such as classification and sentiment analysis. But pulling semantically meaningful information out of internal states of the model is tricky because there's not necessarily anything about the model architecture and training methods that forces it to develop internal representations that are particularly interpretable or have a straightforward application to some other task.
They may not be all that stable, either. I would not just assume that knowing how to interpret the attention heads in the base model of GPT-4 tells you anything about what the corresponding attention heads are doing in GPT-4t or GPT-4o.
To understand, we need to take a step back, and completely deconstruct the familiar narrative.
The moment we say "an AI", we anthropomorphize the model. The narrative gets even more derailed when we say "Large Language Model". No LLM actually contains a grammar or defined words. An LLM can't think logically or objectively about subjects the way a person can, either.
I propose that we instead call them "Large Text Models". An LTM is a model made from training a neutral net with text. Sure, the text itself was written with language, but no part of the training process or its result does anything about it.
The really cool trick that an LTM does is to construct continuations whose content just happens to be indistinguishable from language. This is accomplished because the intentions of the original human writer (that were encoded into the original text dataset) did consistently follow language rules. The problem is that the original writer did not encode the truthiness or falsiness of future LTM continuations. Truth and lie are written the same, and that ambiguity lives in the LTM forever.
Most presentations of generative AI come across like amnesiac, superficial polymaths. Perhaps beyond negative and positive prompts, they need some memory across queries and across users, a feedback path (hard problem), an ability to discern reputability (very hard problem), and an ability to search and train on new domain-specific information on-the-fly.
Hard, and also might not be particularly commercially viable. Once you give a model an individual memory, you lose the standardization and consistency of behavior that is important for scaling up commercial applications.
I would not want, for example, a version of Copilot that slowly gets better at helping me with the task I'm working on and then suddenly and unpredictably reverts back to zero just because the k8s pod running the instance that I had been interacting with got recycled. Consistently mediocre behavior would be preferable to the AI equivalent of the pair programming equivalent of speed dating.
My point is that to piece together an almost AGI agent or at least more useful generative AI requires these changes. A killer app isn't a singular magic thing but many small advancements. OpenAI still isn't even close, but it's demoing bits and pieces that are closer.
Natural language search and answer generation _are_ completely separate.
Search is often (but not exclusively) performed using cosine similarity over semantic vectors. Such vectors are produced using embedding models, which represent the meaning of the document via an arbitrary length vector called an embedding, and 768 is a common vector size for this.
You calculate the embedding for all documents in your database ahead of time (during insertion), and you calculate the embedding for the user's query, then search for documents closest to the query using a similarity metric of choice such as cosine.
Nothing prevents you from serving documents found this way directly, instead of using them to generate answers. Part of the Google search pipeline involves something like this. Many full-text search products also do this (Algolia is such an example https://www.algolia.com/blog/ai/what-is-vector-search/)
LLMs and generation are used to synthesize the answer and tie it back into the question using a layer of soft judgment based on the LLMs prior knowledge. This does work out great in some contexts, less so in others, as you pointed out. But these components aren't coupled in any way.
It really isn't? You can tell it to output in a JSON structure (or some other format) of your choice and it will, with high reliability. You control the output.
Honestly I wonder if the people who criticize LLM's have made a serious attempt to use them for anything
I made a serious attempt to do precisely that, and yes, it output a valid JSON structure highly reliably. The problem was stopping it from just inventing values for parameters that weren't actually specified by the user.
Consider the possibility that at least some of the criticisms of LLMs are a result of serious attempts to use them.
llama.cpp has a way to constrain responses to a grammar, which is 100% reliable as it is implemented in the inference itself. You still need to tell the model to produce a certain format to get good results, though.
> You can tell it to output in a JSON structure (or some other format) of your choice and it will, with high reliability.
I mean, this is provably false. Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at reliably following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway. By "forcing" I mean either (1) multi-shot prompting: "no, not like that," if the output isn't valid-ish JSON; or (2) literally stripping out—or rejecting—illegal tokens (which is what llama.cpp does[1][2]). And even with all of that, you still won't really have a production-ready pipeline in the general case.
Beyond this, an LLM can easily become confused even if outputting JSON with a valid schema. For instance, we've had mixed results trying to get an LLM to report structured discrepancies between two multi-paragraph pieces of text, each of which might be using flowery language that "reminds" the LLM of marketing language in its training set. The LLM often gets as confused as a human would, if the human were quickly skimming the text and forgetting which text they're thinking about - or whether they're inventing details from memory that are in line with the tone of the language they're reading. These are very reasonable mistakes to make, and there are ways to mitigate the difficulties with multiple passes, but I wouldn't describe the outputs as highly reliable!
I would have agreed with you six months ago, but the latest models - Claude 3, GPT-4o, maybe Llama 3 as well - are much more proficient at outputting JSON correctly.
Seems logical that they will always implement specialized pathways for the most critical and demanding user base. At some point they might even do it all by hand and we wouldn’t know /s
Yes, I'm using them quite extensively with my day to day work for extracting numerical data from unstructured documents. I've been manually verifying the JSON structure and numerical outputs and it's highly accurate for the corpus I'm processing.
FWIW I'm using GPT4o not Llama, I've tried Llama for local tasks and found it pretty lacking in comparison to GPT.
Your comment has an unnecessary and overly negative tone to it that doesn't do this tech justice. These approaches are totally valid and can get you great results. An LLM is just a component in a pipeline. I deployed many of these in production without a hiccup.
Guidance (the industry term for "constraining" the model output) is only there to ensure the output follows a particular grammar. If you need JSON to fit a particular schema or format, then you can always validate it. In case of validation failure you can always pass the JSON and the validation result back to the LLM for it to correct it.
> Have you tried to use LLMs to generate structured JSON output? Not only do all LLMs suck at relaibly following a schema, you need to use all kinds of "forcing" to make sure the output is actually JSON anyway.
Yeah it's worked about fifty thousand times for me without issues in the past few months for several NLP production pipelines.
I don't think it's entirely an architecture problem, there's a huge training set problem. Almost all the content you would train on assumes outside knowledge/facts, so the llm learns to store these in the model in order to be able to maximize it's completion ability. However, a lot of this assumed knowledge/generalizations are actually unhelpful/harmful for these sorts of cases.
If you wanted to create an llm that significantly improved on this you would probably need to make sure to massively clean up / reorganize your training data so that you always provide sufficient context and the llm is disincentived from baking in "facts". But I'm not sure this is tractable to do currently at the scale of data needed.
The LLM miracle comes from the massive amount of text we can use to train it on. Removing that advantage makes LLMs untenable. An idea I've had for a while is to do the opposite: generate nonsense text according to some complex formula, and have the AI learn to predict that. It won't possibly be able to encode any facts, because there are no facts. Now show it english, and it will treat it just like any other sort of nonsense text that it's gotten good at learning to interpret.
But that idea you describe is exactly what would make the LLM stop working. The "LLM miracle" comes from the fact that all that text is not random[0]. There is a lot of information encoded in what phrases, sentences, paragraphs have been written (and how often), vs. a vastly larger amount of nearly identical texts that were not written. The "complex formula" used is... reality, as perceived and understood by people. LLMs pick up on that.
--
[0] - Well, most of it anyway; I bet the training set contains some amount of purely random text, for example technical articles discussing RNGs and showcasing their output. Some amount of noise is unavoidable.
The idea would to generate a false "reality" for the LLM to learn about. You would randomly generate a system of rules, use those rules to generate text, and then train the llm to predict the text. The goal would be to get it to stop encoding the reality proper in its weights, and focus on learning to pick up what reality looks like very quickly from text.
Bonus points for one of the most delightfully creative ideas I’ve heard in some time. I don’t think it will work (the space of "not reality" is superexponentially larger than the space of "this describes reality") but I’m just happy to be thinking about nonstandard ML again.
(I’ve dubbed this sort of thing “nonstandard ML" since, like you, I have a fondness for thinking of unorthodox solutions that seem plausible.)
It will just learn your formula and won’t generalize to anything else. It would essentially have to unlearn it when you started training on English so it would make training slower.
It was trained as a chat-bot so that's all the current training can do. If you want to use it, you must hack something useful out of the chat bot interface.
It was trained as a chat bot because that was the impressive thing that got them investors. Useful applications need a lot more context to awe people, and context takes time and work to create.
So, now that they've got money, are those companies that created those LLM chat-bots making a useful next generation engines behind the scenes? Well, absolutely not! The same situation applies on every researching round, and they need to show impressive results right now to keep having money.
(And now I wonder... Why do VC investors exist again?)
> they need to show impressive results right now to keep having money
Sure. You do as little as possible to make as much money as possible. This is a fundamental of commerce/human existence. But, at some point, it will end with everyone's models performing similarly, with free models catching up. The concept of sustained "impressive results" will eventually require actual "reasoning systems". They'll use all that accumulated wealth, from what you maybe perceive as low hanging fruit, to tackle it. I think it must be assumed that these AI companies are intentionally working toward that, especially since it's the stated goal of many of them. I think you must assume that these people are smart, and they can see the reality of their own systems.
You can generate JSON or other machine readable formats to use as inputs to APIs allowing the LLM to directly operate whatever software or hardware you want. You can’t remove next token prediction without fundamentally changing the architecture and losing all of the benefits (unless you invent the next big thing of course). Each generated token has to go back in order to get the next one. Perhaps if you could simplify your API to a single float value you could just do a single step but I doubt this would work as well. Progress will continue to be made through further training and fine tuning until another significant discovery is made.
The generation is a showy gimmick.
So why aren't we separating the useful bit out? My sneaking suspicion is that we can't. It's a package deal in that there are no two parts, it's just one big soup of free-associating some text with other text, the stochastic parrot.
LLM do not understand. They generate associated text that looks like an answer to the question. In order to separate the two parts, we'd need LLMs that understand.
That, apparently, is a lot harder.