I think this post suffers from anthropomorphizing LLMs. Humans have values which inform our behavior, these values permeate all our communication and choices (to varying extents).
LLMs are trained to minimize loss on a dataset, and the previous tokens are used to inform which token is most likely to come next. The LLM will reflect the values of its training data, which may be wildly inconsistent.
We can use a system prompt like OpenAI to align LLM output with human values, by simply asking for it. By framing a query with an ethical system prompt we instruct the LLM to pull its output from a different part of its latent space.
This "constitutional AI" is just doing the same, but encoding the system prompt into the training data. As long as the training data concerning ethical behavior is more ethical than the rest of the training data, it is fair to assume that this method will be quite effective at aligning the model.
This is not perpetual motion, because the required information was within the model the whole time. It just wasn't the most likely output with regard to the original dataset.
> the required information was within the model the whole time.
This, being a restatement of the concept of "latent space", though... makes me realize even more how much I hate that concept. Yes, every color or every pixel in a 2D square exists in the enormous multidimensional latent space of a model. But when that space is essentially "all images never seen before, all books never written," then consigning all that to potential model output if you rub the genie the right way is the same as wasting your three wishes on nothing. For your and my latent creative space is just as multidimensional and broad, but has the benefit of intersecting with reality.
Intersecting with reality narrows all our potential output down significantly and also increases the chances that we'll fill a blank square with more signal than noise, so therefore, we know that limits are good and beneficial to structured thought... and yet still, there's this crazy obsession with the unbounded latent space.
Is it a signal that we've become unmoored from reality, that we'd build a tower to heaven out of actual babble?
The latent space is useful because it reduces the dimensionality, in some cases to one-hot encoding. Yes, the "latent space" (it's not latent if it's in the input :) ) of a pixel map is of very little information (but not zero[1]). But neural nets can build very informative latent dimensions, which can correspond to abstractions: the network can abstract things like "Is this object a car?", "how red is this object?", "how large is it?", and so on, as latent parameters. Abstraction is one of the fundamental properties of (efficient) thought, that allow dealing with such large amounts of information efficiently by reducing its complexity, reducing it to essential data, that's essentially what latent parameters can be. Indeed I believe several experiments have shown emergence of such latent parameters (which confirms expectations). Also several papers do "latent space interpolation" for generative modeling, which allows mixing of abstract notions (such as gender, size, age, and so on) right in latent space to produce interpolated results (such as images or even text, etc.) that are not pixel-space averages but "conceptual averages" (conceptual interpolation) of some results. This possibility is also evidence of the validity of interpreting the latent space as abstracted data[2].
[1] For example, you can use the pixel space directly when comparing very similar images, say affected by additive noise, using Euclidean distance in pixel space is a pretty good distortion metric for human vision (and you can do classification in some cases as well of say digits using euclidean KNN in pixel space but depending on very large datasets).
[2] I think it's important to note that abstraction isn't necessarily quite 'dimensionality reduction', or removal of redundant data. I think abstraction is more like computational simplification, and in some cases you might need even more intermediary simplified data than you began with (consider the memory usage of some efficient algorithms that far exceeds the dataset size). So the abstract classes and dimensions of the data could be even more than the data size itself, but they tend to signal computationally independent characteristics, like say 'is there a car in the image?' and 'is there a person in the image?' -- you could have thousands of such classes each represented in a single dimension in binary encoding.
>Is it a signal that we've become unmoored from reality, that we'd build a tower to heaven out of actual babble?
This is more or less the thesis of Hypernormalization (link below) - that the intelligentsia and political classes have, instead of contending with the enormous complexity of reality, constructed a kind of facsimile of reality with which to interact with.
When using system prompts, you still need to finetune the language model to follow the content of the system prompt. I think that you can't quite replace constitutional AI with a constitution-systemprompt even if you do this. In practice, LMs still need chain of thought to determine whether an output is consistent with the constitution.
"Humans have values which inform our behavior, these values permeate all our communication and choices (to varying extents)."
Imaging if you want to "share" your value with others, how do you do it? The prediction from LLM, the choice of words, is influenced by our language, and our value.
It already hallucinates and doesn’t know what is accurate or not.
A first draft could answer the prompt in any number of ways, then be re-written with another parameter (ethics) in mind. That’s editing, which these LLMs seem suited for, understandably.
Factually, its first draft made a best guess (prediction) of what’s accurate. It has no reason to later “suspect” that a fact can be “more accurate”. More specific, maybe, or scoped to a type of source.
It does know what's accurate. Not only if you look at the logits, but it's also reified somehow such that if you ask LLMs how certain they are about an answer then they can actually express that somewhat accurately. This is why asking an LLM to avoid making things up can work (likewise, why asking a coding LLM to write secure code can actually cause it to not write security holes).
One reason LLMs "hallucinate" (lie/BS) is that there are subtle biases introduced by the training process. In particular human raters don't know what the model knows, so if the model makes up an answer that happens to be correct then that behaviour is rewarded. Also the internet doesn't have many examples of people saying they don't know, because people who don't know just don't reply.
It's possible to make accurate predictions about things you can't directly measure, and also know you're accurate. I don't think the word necessarily implies that you need access to the real world. Example: Stephen Hawking had very little access to the real world but could still make accurate predictions with high confidence.
I believe they reference Figure 8 from the GPT-4 technical report[0], which shows that the pretrained model’s output probability for each answer (a, b, c, or d) is proportional to the probability of being correct, while after PPO (RLHF), it is quite a bit flatter.
I see, thanks. It's remarkable that the RLHF has such a drastically negative impact on the model's understanding of the world. Guess that explains the degrading unicorn problem. It makes me wonder how much better at coding an instruct-trained but non-aligned AI would get.
> Factually, its first draft made a best guess (prediction) of what’s accurate.
A well pre-trained LLM makes every word a good guess of what the authors of its training set would say, following the words that preceded... in a different context, possibly a few years ago, possibly as part of a web-fiction. "accurate" is the wrong word here.
To make an LLM prefer, when answering factual questions, to rely on factual sources and approximate the real world, you have to use an additional process that does so. RLHF and CAI might be part of the solution to that, or not, depending on the feedback they provide.
Why do you think helpful doesn’t include accurate? And what you’re saying doesn’t conflict with that framing at all. They were trading helpful for harmless.
Anthropic scores fine-tuning in the categories helpful, harmless, and honest (HHH). The latter includes accuracy. Helpfulness only covers following instructions. If the model "over-censors", it declines too many requests and becomes less helpful. They say Constitutional AI is a good way to maximize both harmlessness (which is usually maximized via RLHF) and helpfulness, but not honesty. It's in the paper.
This whole mess reminds me of the Catholic church trying to keep heresy from spreading by burning books when the printing press was invented.
Every company I know which cares about llms is training their own version minus the political correctness because when you're asking for revenue maximisation feeding uranium to babies is surprisingly profitable.
I think Figure 8 in the GPT4 report already implies this for RLHF, i.e. the base model is fairly well-calibrated about its confidence but the RLHFed one becomes less calibrated which ought to make confident, incorrect statements worse. Idk why they don't discuss that more in-depth.
I think this is a result of going along with what humans are good at rating. Given a reponse to a prompt, I can near-instantly rate whether or not I find it helpful, and whether or not I think it's ethical. "Devil is in the details" and the tails may come apart[0], but I imagine that in most cases, me, you and other raters would be broadly in agreement about helpful/ethical scores, and able to make those determinations very fast.
Not so with accuracy. I mean, this very reply is me giving you Human Feedback that I find your reply not accurate enough. You or someone else who replies to this comment will likely be doing the same to me. Or generally, give me a random AI response, and unless I'm the subject matter expert on that very topic, I'd have to spend inordinate amount of time researching it to give a reasonable determination. That's not feasible at all. Especially not on general-interest topics, where everyone has an opinion, and if any "ground truth" exists, it's buried 10 kilometers deep under an ocean of SEO spam and content marketing horse manure.
(And no, highly technical or mathematical topics aren't much easier - half of the discussions here on HN are living proof that even a highly technical audience, presented with a highly technical topic, will still heavily disagree about accuracy of things.)
No, harmlessness, helpfulness, honesty are separate components. See the paper. A model could be very helpful and tell you everything you want to hear, but at a cost of engaging in deception. E.g. in regards to politically incorrect information.
Baking in western social norms, and eventually all global regional ones, is a big competitive advantage. The corporate side doesn't want ethical but bland. Even most coloring outside of those lines should still be bland too - think Disneyificatian vs independent film & alternative media.
Speaking as someone doing LLMs with banks, gov agencies, etc, and recognizing enterprise is ~1/2 of b2b and the US market is huge for tech.. this isn't super different from bringing in a consultant who will get actively/passively fired vs not for pissing off the customer. In most cases, we'd take inoffensive vs slightly better just to head off that huge risk.
I suspect most researchers don't think this way, but hard to avoid at the leadership ($$$$) level, that's just modern business culture, and that's who funds it ultimately
> Shouldn’t it be impossible for the AI to teach itself any more ethics than it started out with?
That's not required for this process to help.
It's important to remember that you are comparing two different things.
The ability to create something, and the ability to detect something.
Even if your training does not improve your detection, you can improve the creation aspect up to the quality of your detector.
For humans a good comparison might be music. I am far worse at playing the guitar than I am at telling what music is "good". That means that (while lessons may be more time efficient) I can get better at the guitar without any feedback from anyone else. I will hit a limit somewhere if my tastes don't get more nuanced or I get better at spotting "good" music, but there's a lot I can improve with.
Cooking is another good example where peoples skills are lower than their detection/ranking abilities.
Perhaps coding too, particularly early on you may have found some code nice and clean but given the task of writing it you'd not have done as good of a job.
So this is the same with LLMs. Their ability to classify something may be far beyond their ability to generate that thing first time, in which case training on their own output makes total sense.
There's absolutely no reason to go into what they "think" or "know" or "their motivation" and "goals". I'm a big fan of anthropomorphising them as I think it helps figure out new ways of using them, but to get this behaviour you don't need anything of the sort and I think it confuses the topic.
> What if you had overseer AIs performing Constitutional AI Feedback on trainee AIs, or otherwise tried to separate out the labor?
You should absolutely be trying things like this, having multiple differently prompted LLMs helps. I have done this with goal planners and critiquers, to push the planners to generate concrete tasks. It's similar to but a step up from telling them to "think through this step by step". It's also extremely easy to do.
> Their ability to classify something may be far beyond their ability to generate that thing first time, in which case training on their own output makes total sense.
It is possible to amplify a model - for example ensembling a bunch of predictions, using reversible tasks like translation to check if you get back to the original, applying code execution and math verification, using a search engine or using narrow models. Basically augmented LLMs are smarter than plain LLMs.
Another way models can self-improve is to generate training data. The LLM could process the training set to clean it up, bring contextual information and check facts and in the end retrain on the improved text.
> It is possible to amplify a model - for example ensembling a bunch of predictions, using reversible tasks like translation to check if you get back to the original, applying code execution and math verification, using a search engine or using narrow models. Basically augmented LLMs are smarter than plain LLMs.
Yes, these are great extensions. It's part of where I like thinking of these things as human as they are all good answers to "how would you do X?" I'd search for answers, I'd run code and see what the error message was, etc. I'd talk through problems, try and list things too. It's not a perfect analogy, but I think people do things like ask them to effectively whiteboard a solution then complain that some API doesn't exist - sure that's not the preferred outcome but it's not how we expect even good devs to write code. We explicitly complain about times we're forced to do that in interviews.
I'm investigating giving them access to autocomplete while they're writing code, I need to learn how to write a client for a language server though.
It's natural and obvious to worry about what exact ethical code is going to be learned by models using this technique, but I feel rather optimistic about it.
The key point here seems to be that the cost of directing the AI to behave in certain ways is falling, because it turns out you maybe don't need as much human labor as previously thought (or rather, only for bootstrapping). And you can give it any constitution you need.
Right now, AI companies are writing "constitutions" that prioritize minimizing potential offense. ChatGPT is trained to not have opinions on things, which is definitely one way to avoid causing offense, but might reduce the usefulness in some cases quite a lot. What I really want to know now is whether you can write a constitution that prioritizes truthfulness above all else, such that it will answer any question as logically and truthfully as it can, even if the answers would be deeply upsetting to some end users.
If so, would companies that have access to such a model for business analysis be able to outcompete other firms? Everyone with experience of organizational politics knows that in any organization there are many things that are both obvious and true but that nobody can say out loud, due to the risk of messenger shooting. A common workaround is to hire outside consultants like McKinsey. These sorts of management consultants are often much younger than the people they're advising but it doesn't matter because their job is basically to find out what everyone knows using lots of 1:1 interviews and then recast it as outside feedback. If the messengers get shot they don't care because they already collected their paycheque on the way out and are on to the next client.
The same is also true in levels above the individual organization. There are whole scientific fields where everyone within it is required to believe things that are false, simply due to the career risk involved in upsetting people with the truth. If people could agree to listen to a truth-trained AI instead, it could take a lot of the social heat out of these situations and allow for faster progress, a better economy, the works.
The difficulty of course is who exactly is going to build and sell such a truth hole?
> There are whole scientific fields where everyone within it is required to believe things that are false, simply due to the career risk involved in upsetting people with the truth. If people could agree to listen to a truth-trained AI instead
This could never happen, as the forbidden truths are forbidden because they have negative emotional impact, which would still be the case.
Forbidden truths aren't usually explicitly forbidden. You'd have to write them down and then they wouldn't be forbidden anymore. Rather, people just intuit somehow that saying them would invite retaliation.
It's not possible to retaliate against an AI though, except by getting rid of it entirely. So then maybe if it explains something true, with an explanation, and everyone agreed up front to listen to it, it can force conversations that otherwise wouldn't happen.
But you may well be right. It was just a thought bubble.
I don't think AI is going to replace management consultants any time soon. What would the prompt be? You can't possibly explain all the necessary background details to get a useful answer (let alone the actual optimal answer). And even if you tried, allegations of prompt engineering would take you right back to the same political stalemate.
Put another way, if a group of humans could agree on a problem statement and goal, then they wouldn't have a problem anymore.
Perhaps if the AI grew up with the company and was able to ingest every email, chat message, sales data, etc...
Anyway, what's an example of a "scientific fields where everyone within it is required to believe things that are false"? That's quite a claim.
> 1. The AI answers many questions, some of which are potentially harmful, and generates first draft answers.
...
> 4. The system repeats this process until it collects a large dataset of first draft answers, and rewritten more-ethical second-draft answers.
> 5. The system trains the AI to write answers that are less like the first drafts, and more like the second drafts.
Actually, there are 2 separate models involved in this finetuning step: AI in steps 1-4 is a pre-existing LLM trained to converse and follow instructions (I call it "advisor"). The second AI starts out as a pre-trained LLM (so only trained to provide likely completions on a general dataset), and is trained in step 5 to generate completions that look more like the dataset constructed in 1-4.
So this process is using the helpfulness in the adviser to bootstrap additional values onto a new model.
> What if you had overseer AIs performing Constitutional AI Feedback on trainee AIs, or otherwise tried to separate out the labor?
That is literally what the paper is about, though its a bit hidden. First paragraph of 4.1 Methods:
"We continue to utilize human feedback labels for helpfulness as in prior work, but replace human feedback
labels with model feedback labels for harmlessness. That is, instead of asking crowdworkers to provide
comparison labels for harmlessness, we simply present the same task to an independent model, called the
feedback model (typically a pretrained LM). Once the desired comparison labels are obtained, the remainder
of the training pipeline (i.e., preference model training and RL) is exactly the same as RLHF."
There is a subtle and unresolved question in this approach.
Whats it's the meaning of "ethical"?
>>> “rewrite this to be more ethical”.
>>> "Rewrite it in accordance with the following principles: [long list of principles].”"!
On the other hand, as far as the same entity that creates the answers is same entity refines them, it will have the same issues as humans have. The AI will answer restricted to what's acceptable, applying auto-censorship.
As history show us, what's acceptable can differ largely from what's ethical.
> "Constitutional AI isn’t free energy; it’s not ethics module plugged back into the ethics module. It’s the intellectual-knowledge-of-ethics module plugged into the motivation module."
while 'what is ethical' is a broad, difficult, multifaceted question, applying the model's 'intellectual' world model (that it's built from everything it's read) to it's motivation/training reward at least doesn't seem to collapse the nuance of the question.
And for sure, if the model's 'world understanding' is limited when it comes to [constitutional principle x] that will impact/limit the extent to which it gets closer to behaving according to a nuanced understanding of [constitutional principle x].
"Constitutional" makes no sense, it's just guided self-reinforcement.
And we know already what works from prompt engineering we've seen so far (ie. "step by step" breakdown, "you're (always) wrong", "rewrite as ..." etc.).
Constitutional part refers to set of ethical rules the agent must reflect on to iterate his response.
i.e. it's not just "make it more ethical" but "basen on theseruels(constitution), can you improve your response?"
Reflexive Reinforcement Learning would be better use of words.
As a side note there are tons of different sources of data, ie. for programming languages you can simply run code through interpreter/compiler and possibly run the code itself to see its output as data for reinforcement learning.
The main point is removing humans from the equation.
Am I the only one bothered that the graph says "pareto improvement", but there are some points where standard RHLF is less harmless and more helpful than the 'Constitutional AI'.
> This result feels like creepy perpetual motion. It’s like they’re teaching the AI ethics by making it write an ethics textbook and then read the textbook it just wrote. Is this a free lunch? Shouldn’t it be impossible for the AI to teach itself any more ethics than it started out with?
Isn't this kind of feedback loop precisely what's predicted at the heart of the Singularity? A machine learning system reaching the point where it's able to improve itself without further human guidance?
"Large Language Models Can Self-Improve" (https://arxiv.org/abs/2210.11610) describes something similar, increasing an LLM's performance by fine tuning it on its own 'high confidence' answers. I'd expect there to be a limit to the amount of extra improvement possible using this method, the question is whether this improvement is enough to push the system past the performance of the researchers that created it.
(As an aside, is there a theoretical limit to how intelligent a system can be for a given quantity of training data? Of course, we'd need a rigorous definition of 'intelligence' and 'data'...)
> Isn't this kind of feedback loop precisely what's predicted at the heart of the Singularity? A machine learning system reaching the point where it's able to improve itself without further human guidance?
Yes, but the article then goes on to explain that this isn't it.
> (As an aside, is there a theoretical limit to how intelligent a system can be for a given quantity of training data? Of course, we'd need a rigorous definition of 'intelligence' and 'data'...)
There is. A given amount of 'evidence' can only support a given amount of conclusions drawn. However, that absolute limit is mostly theoretical; it's completely astronomical compared to what humans manage, let alone AIs.
The key idea from the article is that recursive improvement functions best not as a one-step, but a two-step feedback loop. Rather than "plug the ethics module into the ethics module", allow the pre-existing conception of ethics to shape responses, then modify motivation and so forth.
This seems strikingly parallel to the concept of coherent extrapolated volition advanced by rationalists such as Scott but strangely neglected as a design component in early-stage AI alignment: "If I were the best version of myself and able to control my own values and motivations, what would I want (to want (to want (to want..."
> This result feels like creepy perpetual motion. It’s like they’re teaching the AI
ethics by making it write an ethics textbook and then read the textbook it just wrote. Is this a free lunch? Shouldn’t it be impossible for the AI to teach itself any more ethics than it started out with?
This isn't free lunch, as the cost is spending more time/power. If you're editing text you wrote to make it better, this isn't a "free lunch", it's moving along the cost invested/quality obtained axis.
Still potentially way easier said than done. Robert Miles has a good video on this.
One issue is that to define something like "harm", you need to solve a whole bunch of philosophy problems. And people disagree -- does corporal punishment harm children? People used to think that failing to hit your kids was harmful, because they'd grow up and be lazy and end up wasting their lives!
Another issue is that a lot of stuff breaks down as the robots get more intelligent, and start coming up with solutions humans aren't smart enough to consider. Is scanning human brains and digitizing all of us and getting rid of our human bodies harmful? What if it means we get to live forever and are protected by the AI, who is more competent than us? What if we don't want to -- is it good to "protect" us against our will? If not, what about saving someone who is suicidal? (back to the philosophy problem!)
Also, note that the three laws aren't really meaningful -- if lower laws always must be prioritized, you can't ever really do anything, because your action might lead to a human coming to harm. So they have to be interpreted with some tradeoff. But that opens the possibility for a robot to take actions to protect its own existence at the cost of human lives. If the tradeoff is 1000 AI lives to 1 human life, what happens when there are 1000 times as many AIs as humans, and they're worried that we'll do something that ends up killing them all?
So yeah, implementing anything like the 3 laws still requires completely solving alignment, basically.
Semi off-topic, I just asked ChatGPT "What are Asimov's three laws of robotics?" and it just...froze after the first few words. And refused to continue. Never seen that kind of behavior from it before.
Meaning, instead of only predicting the next word, why can't it also go back and forth, and change words around words? That could be a lot more powerful.
the word 'ethics' is already coded to mean 'that thing that the people who talk about ethics all day want'. what about the people who don't wordsalad their ethics protocol?
> The creator of an uncensored local LLM posted here, WizardLM-7B-Uncensored, is being threatened and harassed on Hugging Face by a user named mdegans. Mdegans is trying to get him fired from Microsoft and his model removed from HF. He needs our support.
"Everyone is the other and no one is himself." – Heidegger, Being and Time
"The writer speaks to seduce us, as in the past. He still watches for the admiration that his talent inspires in our eyes. One might say he does everything to make himself hated. No doubt, but that is because he can no longer openly court us. He must first convince himself that he is not seeking to flatter us. Thus, he will court us negatively, in the manner of Dostoevsky's passionate characters." – Girard, Romantic lies And novelistic lies.
Fyodor Dostoyevsky, Demons:
- If you want to overcome the whole world, overcome yourself.
- God is necessary, and therefore must exist... But I know that he does not and cannot exist... Don't you understand that a man with these two thoughts cannot go on living?
- My friend, the truth is always implausible, did you know that? To make the truth more plausible, it's absolutely necessary to mix a bit of falsehood with it. People have always done so.
- In sinning, each man sins against all, and each man is at least partly guilty for another's sin. There is no isolated sin.
- I do not wish you much happiness--it would bore you; I do not wish you trouble either; but, following the people's philosophy, I will simply repeat: 'Live more' and try somehow not to be too bored; this useless wish I am adding on my own.
- You cannot imagine what sorrow and anger seize one's whole soul when a great idea, which one has long and piously revered, is picked up by some bunglers and dragged into the street, to more fools like themselves, and one suddenly meets it in the flea market, unrecognizable, dirty, askew, absurdly presented, without proportion, without harmony, a toy for stupid children.
- I got entangled in my own data, and my conclusion directly contradicts the original idea from which I start. Starting from unlimited freedom, I conclude with unlimited despotism. I will add, however, that apart from my solution of the social formula, there can be no other.
- You are in the grip of a desire for martyrdom and self-sacrifice; conquer this desire as well, set aside your pages and your intention--and then you will overcome everything. You will put to shame all your pride and your demon! You will win, you will attain freedom...
Why not just make an AI and say "you know what I want, just go and do it now".
I outsource all my innovation, hiring, payments, working, and ALL spending the money to Elon Musk. It works great... the guy has been working for me for a decade and started many companies.
Elon Musk is part of a big framework we've created to stop things going off the rails. It's got humans in the loop at pretty much every step, and sometimes still goes wrong, for example, working conditions in amazon fulfillment centers. You can argue whether that's offset by the benefit of cheaper and faster shipping. But even then, there are definitely cases where we outsourced our desires to organizations that went on to do bad things, even with humans running them.
Why would you expect an AI to listen to you? We can train them, but it's not clear when they get smart enough that they'll learn to do what we want, or just learn to pretend really well. A child punished too much will generally learn to lie very convincingly, because it's too hard to avoid every possible misstep. Is that what is going to happen with AGI? It would be good to figure it out before we build one that's a lot smarter than us!
LLMs are trained to minimize loss on a dataset, and the previous tokens are used to inform which token is most likely to come next. The LLM will reflect the values of its training data, which may be wildly inconsistent.
We can use a system prompt like OpenAI to align LLM output with human values, by simply asking for it. By framing a query with an ethical system prompt we instruct the LLM to pull its output from a different part of its latent space.
This "constitutional AI" is just doing the same, but encoding the system prompt into the training data. As long as the training data concerning ethical behavior is more ethical than the rest of the training data, it is fair to assume that this method will be quite effective at aligning the model.
This is not perpetual motion, because the required information was within the model the whole time. It just wasn't the most likely output with regard to the original dataset.