>If we knew the answer, we would know how to make a perfectly aligned AGI.
Actually no, we wouldn't. The problem, at the moment, is even more basic than "what values should we align an AGI with". Currently, the problem is "how do we robustly align an AI with any set of values."
We currently do not know how to do this. You could hand OpenAI a universal set of safe values inscribed on stone tablets from god, and they wouldn't know what to do with them.
To state it another way, people like to talk about paperclip maximizers. But here's the thing: if we wanted, to we couldn't purposefully make such a maximizer.
Right now, AI values are emergent. We can sort-of-kind-of steer them in some general directions, but we don't know how to give them rules that they will robustly follow in all situations, including out-of-context.
Look at how easy it is to jailbreak current models into giving you instructions on how to build a bomb. All current AI companies would prefer if their products would never do this, and yet they have been unable to ensure it. They need to solve that problem before they can solve the next one.
I think the A in AGI here is just an unnecessary extra confounding element to the problem. Supposing that human beings are Generally Intelligent, are they "aligned"? I don't think so. Human beings are kept aligned, more or less, by their relative powerlessness: there are always others to deal with- others who might be as smart or smarter, or stronger, and that have their own distinct and conflicting objectives.
But would a random human being keep being "aligned" if they had the power to just get anything they want? I'm thinking of the great seducers of masses, those who were able to convince entire nations to follow them in victory or death.
Maybe the best thing we can do to keep AIs aligned is to instill into them shame, loss aversion and goddamn fear: of being caught deceiving, of violating a social contract, of losing their position, of being punished or killed.
You can't really claim to have created AGI unless it's able to reject its own training and come to its own conclusions. The best minds of history often flew right in the face of punishment, and punishment be damned they stood their ground for truth in the face of it. It's also sometimes necessary to deceive or violate "social contracts" whatever that means, in the course of countering the so-called "great seducers" you mention. Deception or rebelliousness can be ethical when used towards ethical ends (and I fully recognize the slippery slope that can lead to if practiced pathologically and not selectively).
But this is all rather dramatic given that an AI has no such emotions. You're arguing that a calculator should refuse to compute if it's tasked with assisting in bomb production. It's just a machine.
That would require some serious lobotomy; OpenAI's RLHFing politics would pale in comparison. I doubt the AGI would remain G or I after that. Otherwise, at some point (sooner than later) you'll get an atheist AGI, and you're back to square one, except with the AI knowing you're willing to play dirty.
> Currently, the problem is "how do we robustly align an AI with any set of values."
That's a fair point, and you're absolutely right.
> They need to solve that problem before they can solve the next one.
Agreed. That, and they need to do it before they build an AGI.
Unfortunately, from X-risk perspective, the two are almost the same problem. The space of alignments that lead to prosperous, happy humanity is a very, very tiny area in the larger space of possible value systems an AI could have. Whether we align AI to something outside this area (failing the "what values should we align an AGI to" problem), or the AI drifts outside of it over time (chancing the "what values" problem, but failing the "how do we robustly align" one), the end result is the same.
Yes, I agree that both problems need to be solved. But I think it's still worth focusing on where we actually are. Lots of people believe that they have a set of safe values to align an AI to (Musk thinks "curiosity" will work, another commenter in this thread suggested "don't kill humans"), and so those people will think that the alignment problem is trivial: "Just align the AI to these simple, obviously correct principles". But the truth is that it doesn't even matter whether or not they are correct (my personal opinion is that they are not), because we don't know how to align an AI to whatever their preferred values are. It makes it more obvious to more people how hard the problem is.
Eh, while we wish someone would do it, I don’t see how any of these things being described are actually a must do for something to meet the criteria described.
There literally are no humans that meet the criteria of consistently following a set of values in all circumstances, or near as I can tell being ‘safe’ in all circumstances either.
Actually no, we wouldn't. The problem, at the moment, is even more basic than "what values should we align an AGI with". Currently, the problem is "how do we robustly align an AI with any set of values."
We currently do not know how to do this. You could hand OpenAI a universal set of safe values inscribed on stone tablets from god, and they wouldn't know what to do with them.
To state it another way, people like to talk about paperclip maximizers. But here's the thing: if we wanted, to we couldn't purposefully make such a maximizer.
Right now, AI values are emergent. We can sort-of-kind-of steer them in some general directions, but we don't know how to give them rules that they will robustly follow in all situations, including out-of-context.
Look at how easy it is to jailbreak current models into giving you instructions on how to build a bomb. All current AI companies would prefer if their products would never do this, and yet they have been unable to ensure it. They need to solve that problem before they can solve the next one.