Exactly my point of view. For the most part I do not root for my preferred technology, but rather try to inform my powers about the caveats I see. This way at least the right aspects to check have a chance to enter the debates above my payroll.
Yes, they can. Meta described a clever way in their paper on training Llama3 [1] (in section about factuality).
The idea is to sample several answers to a question that you know the answer to. Let an LLM decide if the given answers are different from your known truth. If so, you found a question, that you can train your LLM in the next post-training round to answer with "I don't know".
Do that a couple hundred of times and your LLM will identify neurons that indicate doubt and from here on have the ability to answer with "I don't know".
[Edit] The article here also mentions a paper [2] that comes up with the idea of an uncertainty token. So here the incorporation of uncertainty is already baked in at pre-training.[/Edit]
We have a couple of systems at work that incorporate LLMs. There are a bunch of RAG chatbots for large documentation collections and a bunch of extract-info-from-email bots. I would none of these call an agent.
The one thing that comes close to an agent is a very bot that can query a few different SQL and API data sources. Given a users text query, it decides on its own which tool(s) to use. It can also retry, or re-formulate its task. The agentic parts are mainly done in LangGrah.
Bolzman machines were there in the very early days of deep learning. It was a clever hack to train deep nets layer wise and work with limited ressources.
Each layer was trained similar to the encoder part of an autoencoder. This way the layerwise transformations were not random, but roughly kept some of the original datas properties. Up to here training was done without the use of labelled data. After this training stage was done, you had a very nice initialization for your network and could train it end to end according to your task and target label.
If I recall correctly, the neural layers output was probabilistic. Because of that you couldn't simply use back propagation to learn the weights. Maybe this is the connection to John Hopkins work. But here my memory is a bit fuzzy.
Boltzmann machines were there in the 1980s, and they were created on the basis of Hopfield nets (augmenting with statistical physics techniques, among other reasons to better navigate the energy landscape without getting stuck in local optima so much).
From the people dissing the award here it seems like even a particularly benign internet community like HN has little notion of ML with ANN:s before Silicon Valley bought in for big money circa 2012. And media reporting from then on hasn't exactly helped.
ANN:s go back a good deal further still (as the updated post does point out) but the works cited for this award really are foundational for the modern form in a lot of ways.
As for DL and backpropagation: Maybe things could have been otherwise, but in the reality we actually got, optimizing deep networks with backpropagation alone never got off the ground on it's own. Around 2006 Hinton started getting it to work by building up layer-wise with optimizing Restricted Boltzmann Machines (the lateral connections within a layer are eliminated from the full Boltzmann Machine), resulting in what was termed a Deep Belief Net, which basically did it's job already but could then be fine-tuned with backprop for performance, once it had been initialized with the stack of RBM:s.
An alternative approach with layer-wise autoencoders (also a technique essentially created by Hinton) soon followed.
Once these approaches had shown that deep ANN:s could work though, the analysis showed pretty soon that the random weight initializations used back then (especially when combined with the historically popular sigmoid activation function) resulted in very poor scaling of the gradients for deep nets which all but eliminated the flow of feedback. It might have generally optimized eventually, but after way longer wait than was feasible when run on the computers back then. Once the problem was understood, people made tweaks to the weight initialization, activation function and otherwise the optimization, and then in many cases it did work going directly to optimizing with supervised backprop. I'm sure those tweaks are usually taken for granted to the point of being forgotten today, when one's favourite highly-optimized dedicated Deep Learning library will silently apply the basic ones without so much as being requested to, but take away the normalizations and the Glorot or whatever initialization and it could easily mean a trip back to rough times getting your train-from-scratch deep ANN to start showing results.
I didn't expect this award, but I think it's great to see Hinton recognized again, and precisely because almost all modern coverage is to lazy to track down earlier history than the 2010s, not least Hopfield's foundational contribution, I think it is all the more important that the Nobel foundation did.
So going back to the original question above: there are so many bad, confused versions of neural network history going around that whether or not this one is widely accepted isn't a good measure of quality. For what it's worth, to me it seems a good deal more complete and veridical than most encountered today.
I second that thought. There is a pretty well cited paper from the late eighties called "Multilayer Feedforward Networks are Universal Approximators". It shows that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function. For non continous function additional layers are needed.
"One Bit Computing at 60 Hz" describes a one-bit design of my own that folks have repeatedly posted to HN. It's notable for NOT using the MC14500... (and for puzzling some of the readers!)
The original 2019 post by Garbage [1] attracted the most comments. But in a reply to one of the subsequent posts [2] I talk a bit about actually coding for the thing. :)
The company I work for has tons of documentation and regulations for several areas. In some areas the documents are well over a thousand and for the ease of use of these documents we build RAG based chat bots. This is why I have been playing with RAG systems on the scale of "build completely from scratch" to "connect the services in Azure". The retrieval part of a RAG is vital for good/reliable answers and if you build it naive, the results are not overwhelming.
You can improve on the retrieved documents in many ways, like
- by better chunking,
- better embedding,
- embedding several rephrased versions of the query,
- embedding a hypothetical answer to the prompt,
- hybrid retrieval (vector similarity + keyword/tfidf/bm25 related search),
- massively incorporating meta data,
- introducing additional (or hierarchical) summaries of the documents,
- returning not only the chunks but also adjacent text,
- re-ranking the candidate documents,
- fine tuning the LLM and much, much more.
However, at the end of the day a RAG system usually still has a hard time answering questions that require an overview of your data. Example questions are:
- "What are the key differences between the new and the old version of document X?"
- "Which documents can I ask you questions about?"
- "How do the regulations differ between case A and case B?"
In these cases it is really helpful to incorporate LLMs to decide how to process the prompt. This can be something simple like query-routing, or rephrasing/enhancing the original prompt until something useful comes up. But it can also be agents that come up with sub-queries and a plan on how to combine the partial answers. You can also build a network of agents with different roles (like coordinator/planner, reviewer, retriever, ...) to come up with an answer.
My experience has been that they are far too unpredictable to be of use.
In my testing with agent networks, it was a challenge to force it to provide a response, even if it was imperfect. So if there's a "reviewer" in the pool, it seemed to cause the cycle to keep going with no clear way of forcing it to break out.
3.5 actually worked better than 4 because it ran out of context sooner.
I am certain that I could have tuned it to get it to work, but at the end of the day, it felt like it was easier and more deterministic to do a few steps of old-fashioned data processing and then handing the data to the LLM.
That is an interesting observation. I have not gotten to the point of too long cycles and I can think of two reasons for that.
Maybe my use case is narrow enough, so that in combination with a rather constraining and strict system message an answer is easy to find.
Second, I have lately played a lot with locally running LLMs. Their answers often break the formatting required for the agent to automatically proceed. So maybe I just don't see spiraling into oblivion, because I run into errors early ;)
The use case we have is that we are asking the LLM to write articles.
As part of this, we tried having a reviewer agent "correct" the writer agent.
For example, in an article about a pasta-based recipe, the writer wrote a line like "grab your spoon and dig in" and then later wrote another line about "twirl your fork".
The reviewer agent is able to pick up this logical deviation and ask the writer to correct it. But given an instruction like "it doesn't have to be perfect", the reviewer will continue to find fault with the output from the writer for each revision so long as the content is long enough.
One workaround is that instead of fixing one long article, have the reviewer only look at small paragraphs or sections. The problem with this is that the final output can feel disjointed since the writer is no longer working with the full context of the article. This can lead to repeated sentence structure or even full on repeated phrases since you're no longer applying the sampling settings across the full text.
In the end, it was more efficient and deterministic to simply write two discrete passes: 1) writer writes the article and 2) another separate call to review and correct.
How do you get the output to be formatted correctly or without any branches.
Say for example I want a step-by-step instruction for an action.
But the response will have 1. 2. 3. and sometimes if there are multiple pathways there will long answer with 2.a,b,c,d. This is not ideal I would rather have the most simple case(2.a.) and a short summary for other options. I have described it in the prompt but still cannot get nice clean response without to many variations of the same step.
I have not encountered this problem yet. When I was talking about the format of the answer I meant the following: No matter if you're using Langchain, Llamaindex, something self made, or Instructor (just to get a json back); under the hood there is somewhere the request to the LLM to reply in a structured way, like "answer in the following json format", or "just say 'a', 'b' or 'c'". ChatGPT tends to obey this rather well, most locally running LLMs don't. They answer like:
> Sure my friend, here is your requested json:
> ```
> {
> name: "Daniel",
> age: 47
> }
> ```
Unfortunately, the introductory sentence breaks directly parsing the answer, which means extra coding steps, or tweaking your prompt.
It's pretty easy to force a locally running model to always output valid JSON: when it gives you probabilities for the next tokens, discard all tokens that would result in invalid JSON at that point (basically reverse parsing), and then apply the usual techniques to pick the completion only from the remaining tokens. You can even validate against a JSON schema that way, so long as it is simple enough.
If that's what you need, it would make all sense to redo the instruction fine-tuning of the model, instead of fiddling with prompt or processing to work around the model settings that go counter to what you want.
At the very beginning of my journey I did some fine tuning with Lora on a (I believe) Falcon model, but I haven't looked at it since. My impression was that injecting knowledge via fine tuning doesn't work, but tweaking behavior does. So your answer makes much sense to me. Thanks for bringing that up! I will definitively try that out.
Interesting, it seems that using an LLM as an agent to help with knowledge retrieval is one concrete use case that I've seen people do repeatedly.
It also feels like we are at a bottle neck when it comes to the knowledge retrieval problem. I wonder if the "solution" to all of these is just a smarter foundational model, which will come out of 100x more compute, which will cost approximately 7 trillion dollars.
I also think of the retrieval part as a bottleneck and I am super excited of what the future holds.
In particular, I wonder if RAG systems will soon be a thing of the past, because end to end trained gigantic networks with longer attention spans, compression of knowledge, or hierarchical attention will at some point outperform retrieval. On the other hand, I can also see a completely different direction coming, where we develop architectures that, like operating systems, deal with memory management, scheduling and so on.