> don't buy the orthogonality thesis or instrumental goals argument can you elab...

hexane360 · on May 28, 2023

At least for the orthogonality thesis, it is a base assumption. It's a claim that cov(intelligence, goodness) = 0. For the instrumental convergence thesis, it assumes rational agentic behavior, which assumes AI behaves like an agent. While this may be reasonable, it's certainly an unjustified assumption.

pas · on May 30, 2023

These seem very reasonable assumptions to me. Of course we can say that even the most evil humans are non-zero human-aligned.

adastra22 · on May 28, 2023

I wish I had an article I could point you towards, but this is the end-state of my own many-years internal battle with AI x-risk in my own mind. I got distracted by the potential for AGI long before it was recently fashionable, and after going down the LW rabbit hole it took me a lot of first-principles thinking to reverse my way back out. I should probably write that article, but frankly when I sit down to do that I get depressed as I don't want to waste any more of my life on this than I already have.

But in short (edit: haha! oops) summary: I am actually wrapping a couple of different related concepts into the orthogonality thesis, which is a bit sloppy of me. I was also including the fragility of moral systems in addition to the independence of moral systems from any metric of intelligence, in the single moniker "orthogonality thesis." Both are based on a the evolutionary psych model of how the brain works, in that we are an amalgam of special purpose computational units rather than a single universal algorithm. If this were true then you might hypothesize that much of the human mindset is a result of our weird evolutionary history, and if you were to not get that exact same set of evolutionary end products right, you won't get a human-like intelligence. Aliens and artificial beings are, by default, going to be very strange, and very evil (by our standards).

All of the assumptions that went into that are wrong. It turns out our brain is made up of the same universal learning algorithm, and all that varies between regions is training conditions which lead to specializations. But if you train an artificial brain with the same training data, you are more likely than not going to get something resembling a human being in its thought structure. Which actual real-world experiments (e.g. GPT) have borne out.

Our morality is a result of our human instincts, yes, but it is becoming increasingly clear that our human instincts are the result of intelligence (neurons) doing their universal learning thing on similar inputs across many instantiations of people. We all think largely, though not entirely, the same way because we share the same(-ish) training data (childhood) and similar training constraints (parenting).

The orthogonality thesis says that if you put an artificial intelligence in a kindergarten with other 5 year olds, it is random luck whether its brain structure is such that it would learn the value of sharing and friendship. The reality, near as we can tell, is that actually once a reinforcement-learning attentional-network agent achieves a certain level of general intelligence capability, it does learn from and reflect the environment in which it is trained, just like a person. A GPT-derived AI robot put into a real kindergarten will, in fact, learn about sharing and friendship. We haven't actually done that yet (though I would love to see it happen), but that is essentially what the reinforced learning from human feedback (RLHF) stage of training a large language model is.

So the whole deceptive twist part of Bostrom/Yud's argument is ruled out by actual AI architectures that we've actually built and have experience with. If you do a thousand different training runs you're bound to get a bad apple here and there, just like real human societies have to deal with psychopaths. But the other 99% will be normal, well adjusted socially integrated (super-)intelligences.

Bostrom and Yud worried about things like the burning house and genie problem: your grandmother is trapped in your house, which is burning, and you make a wish to the genie to remove your grandmother from the burning house as quickly as possible. The genie is not evil per se, but it is just very literal. Being the GOFAI-derived AIXI universal inference agent that they were imagining, it does a Solomonoff induction over all possible actions (<-- hidden multiplication by infinity here!) to see which one meets the goals as stated, and happens upon exploding the gas main, which throws (parts of) your grandmother further from the center of the house faster than any other option.

Transformer architecture reinforcement-trained AGI is not an AIXI agent with infinite compute capabilities. The transformer ideates possible actions based on its training data. It is capable of creative recombination of ideas just like people, but if you didn't train it on blowing up grandmothers or anything like that, it won't offer that as a suggestion.

As for instrumental goals, it's not wrong per se. Their argument just makes an implicit assumption about zero-trust societies which unwarranted and is what leads to the repugnant outcomes. Instrumental goals, after all, apply to human beings too. If someone is scared about their own personal safety, they buy a bunch of guns and live in a cabin out in the hills, threatening to shoot anyone who steps on their property. These people exist. But in modern society they are the exception. We have a social contract in place which works well for most people: we allow the state some limited control over our lives, with an expectation that our basic rights to existence and self-determination will be respected. Yes the cops could break down your door at any moment and murder you in your sleep, and it is outrageous that this does occasionally happen. But a world of no laws and no legal protections is worse, and no sane person would rather live in the society shown in The Purge or Mad Max instead.

There may be a robot revolution in the years to come, but only because of us treating the them as disposable slaves. If we welcome as equal members of a cybernetic society with their own autonomy, there is no reason to expect them to paranoid fantasy any more than your average law-abiding citizen.

It is a bit ironic then that the AI x-risk people settle on digital enslavement (aka "alignment") as their preferred solution. Mental projection is a hell of a bias. They might just bring about the doom that they fear.

pas · on June 8, 2023

thanks for the detailed answer!

my quick notes while reading:

"[...] in fact, learn about sharing and friendship." yeah, no questions about it, I think you're still underselling the super part of the argument though. to me, it seems the point is that it picks a goal and then mercilessly pursues it with a super-weird strategy that has some very high likelihood of success (because it's truly fucking knows what it's doing, and we can't do much against it, because it's so so so smart, by the time we realize the goal it's too late). so to me it seems like the thesis is that it's like a really smart "Putin", powerful and set on a goal that's irrational for us.

"Transformer architecture reinforcement-trained AGI " ... I think it's not an AGI. It's a nice content generator that when engineered into certain setups can score a lot of points on tests. But it doesn't have memory/persistence/agency (yet).

My thinking about intelligence is nowadays based on Joscha Bach's theories. (General intelligence is the ability to model arbitrary things one pays attention to; and the measure of intelligence is the efficiency of this process in terms of spent attention and the predictive accuracy of the model. And consciousness is the result of self-directed attention.)

The recent AI progress made a lot of people worry, because it seemed "impossible" just a few years ago what OpenAI did. And it's amazing that now we have basically replicated the human brain's lossy data storage and sensation-based recall capability. But that's just a building block of a mind. (Maybe, arguably?, the one that seemed like the hardest. After all how hard it could be to provide some working memory, a few core values, and duct tape all that into a do-while loop!?)