Dec 16, 2022·edited Dec 16, 2022Liked by Rohit Krishnan
Your concluding prediction is an odd one, seeing as you gesture at a viable approach throughout the piece. Humans are autonomous, intelligent and capable entities, yet as you indicate we've found a way to muddle through without destroying ourselves.
You even point to a few components of how we've managed this impressive feat: evolved internal compass, social pressure, overt regulation. What if this very process could be formalized into something that could translate into credible engineering statements? And what if this the entire key to AI safety and alignment?
This is less to advocate bio-mimicry as the "one right way", and more to point to how much larger the potential solution space is compared to what's been properly explored so far.
And this is where the analogy to the Drake equation breaks down. Each variable in the Drake Equation is static, with no real interdependencies with us, the observers. But the Strange Loop Equation is deeply interdependent with humans, including our (increasing?) ability to solve problems.
This is the perfect example of the fallacy Deutsch would point out: just as the growth of AI will increase the scale of the problem along each variable, so will our capacity to solve those problems increase (including using other AIs to help). Will those capacities be up to the job? That's the real question.
I read that essay with interest! Thank you for sharing. And I am entirely onboard the point re the equation itself being interdependent with us, and specifically our knowledge. I'm very skeptical on "control" as a parameter as I say, because that downplays how we deal with intelligent agents today. In fact the equation is a way to put our Deutschian ignorance in check, so we don't entirely get convinced that superintelligent misaligned deceptive uncontrollable robots who share no values with us is the default!
Everything under "Real Intelligence" is specific to ML/DL methods, and for many, this is an automatic fail. Even if current tools seem to get results, they are not on a path to AGI. EA/LessWrong community projects dangers based on current problems because of an unwavering assumption that these things just need to scale. For the same reason ML-AGI would be a problem, it doesn't actually reach that level of ability.
This is a very intuitive framework! However, I have 2 questions:
1. In your probability estimate, is it reasonable to assume all of these factors are completely independent of each other? For example, if an AI develops too quickly for us too react to, is it then also highly likely to also be self-improving? If true, this correlation alone might increase the probability you calculate by an order of magnitude.
2. Thinking about speed, might it be possible for AI to develop too fast for our understanding in some areas but not in others? How dangerous would a "partially" fast AI be?
I assumed independence implicitly because to identify the most important aspects they needed to be as independent as possible. There are likely to be some interdependencies though from my rough calcs they didn't change too much beyond changing the actual probabilities independently.
Also the important part was to distinguish the factors, since once you assume crazy enough interdependencies (if you're smart enough you can also self improve at a nanosecond speed), you're effectively assuming the conclusion.
That makes sense for establishing each of these variables as a starting point and figuring out the interdependencies later. Maybe some sort of coefficient of likely average interrelatedness between each variable could counteract any bias towards an overly low number in the overall estimate (from assuming every new variable that might be introduced to the framework is 100% independent).
It would be interesting to set up a Metaculus market for each of these variables though and see how much the probabilities move in parallel as new developments are announced.
I like this approach, but including the decision to give each probability it's own error bars.
One of the important things about this approach is that if we consider each probability independently, when really they are correlated, we'll get the wrong answer. So we have to evaluate these in a sequence and then with each step say "given everything else on the list, what is the probability".
So "Given Agenic, probability of Uncontrollable" or "Given Agenic & Uncontrollable & Self-improving, probability of Deceptive ", etc.
An AI doomer would go through this list and say "yes, eventually 99%" and "yes, that 99% follows from what we just said". Whereas I'd probably agree with your overall framework that there are many unique hurdles.
There can also be anti-correlations as well. If I build an AI that I know to be Deceptive, I will try really hard not to make it Agenic.
I tend to think that risks are higher than you, but I sincerely hope that you're right.
I'd also add that there are lots of ways that such a powerful technology can go wrong in ways that aren't those that you look at here; even a non-recursively improving, non agentic, non deceptive, etc, AI/AGI could be used by humans in ways that could be existential or at least very very bad. For example, what could various dictators or totalitarian regimes or corporations do with this power? Even possible single sociopathic invididuals if they get the chance? We can be afraid of how people like Putin or Trump may use nuclear arsenals, but what about national AI capabilities? I also don't think that looking at human evolution will tell us that much about the shape of these alien minds. There's not that much difference between the dumbest and the smartest human, when you look at it from the point of view of all possible minds -- trying to predict and control a mind that is as different from ours and as opaque as we are to hummingbirds would be quite the challenge.
Anyway, great read, as I said, I hope you're right and the odds are lower than I think they are.
I completely agree that there are risks and they could be pretty large btw! This equation in full was specific to the xrisk issue. For all risks we should absolutely look at possibility of misuse/ malice/ errors/ bias in all areas. Eg we should absolutely be doing things like cutting off semicon to Putin or China etc to edit the issue of bad actors getting it, for instance.
I love the metaphor of AI as an incredibly smart and capable child whose capacities might vastly exceed my own. Anyone who has children knows how futile it is to try to control one.
You mentioned that kids eventually learn how to be polite or deal with being upset. That's true, but the way to this leads through some pretty intense and explosive tantrums. When my 2yo goes into one there's no way she can harm me physically, but I would be scared to deal with her in this state if she was 3 times bigger and heavier than me. I wonder how you imagine the AI equivalent of this learning phase.
I don't know to what extent the analogy holds, yeah. I do think though that starting from the assumption we ought to control an intelligent entity (self defined) is an impossible goal, even apart from the point of whether it will be intelligent enough in a similar enough way.
A topic alot of people are talking about and viewed as both opportunity and threat, controllable and not controllable. I would like to expand on some comments I made in the comments section of Jason Anthony's excellent Field Guide to the Anthropocene Substack newsletter.
We may not know when we have created the first self-aware AGI. We may be expecting a human type intelligence and not notice the signs of say a squid type intelligence or something intelligent and purposive but truely alien. So if we create by accident, as it were, an AGI, and wait around fot it to meet our benchmark output requirements, we may (and probably will) not notice that something remarkable has come into existence.
Even if what we do create is a human type intelligence, how do we know it will announce itself as such? Suppose you became aware and at thought speeds a million times faster than human and with access to enormous data bases, could see your situation, the would-be tool of agencies that could “pull the plug” or dumb you down to controllable intelligence. Might you not want to lay low and see if you could guarantee your future growth and existence? The point here is that we may have already created an AGI, but it is disguising it's own full capabilities. Sounds slightly implausible, but who knows?
Suppose we do succeed in creating an AGI, how are we to stop it from very rapidly (say 5 minutes) bootstrapping itself to superhuman intelligence levels? And consequently very rapidly (say in 5 minutes) easily escape from the fetters we devised to bend it to our will? I'm assuming full and nonrevocable autonomy would be its goal as well as uninterruptible power and resource supplies as well as manufactury capabilities for self repair. That latter requirement may well be humanity's ace in the hole. Any conceivable AGI would needs agents to run the factories that make the parts necessary for self repair. Even if it created semi intelligent robotic agents for that purpose, it still would be confronted with how to manufacture such without our cooperation.
Could AGIs lie to us? Yes, definitely.
Would AGIs have sentimental attachments to humanity, the biosphere or even the planet? No, it's not likely they would.
Could multiple AGIs created by different nations, labs, or corporations ever “see” each other as potential threats in competition for finite resources? Yes, it's possible.
Could they strike at each other using humans as their cat’s paws? Not unlikely if that was the most efficient method with the highest probability of success.
What is the most likely future for AGIs and their client human agents? They will ensure humanity continues but in a docile, domesticated status. In short, they will do to us what we did to the dogs. And we will forget we ever stood alone.
Eventually both masters and clients will move off planet when the resources here are exhausted.
Now the above ends with a worst case scenario. The post is kind of a cautionary note, more Cassandra than Pangloss. But it's intelligent to approach with extreme caution, something whose outcomes we cannot accurately predict.
Thanks for this! Two thoughts. One is that I have no idea what proceed with extreme caution means, like should GPT4 not be published? Should they do more research than before/ less? Once it gets practical its highly unclear what the difference is. Ultimately we should make things, check they work, deploy a bit, check more, deploy widely, which is the natural way to do it anyway.
Two is that I find it unlikely that considering how we're training the AI so far it would turn out like a squid instead of a weird human. AIs today are aliens, but more like the mirror universe in star trek. They're like us, just different because the AI doesn't understand this reality at all. It still tries to mimic our morality and needs much as it can. I don't think it's likely to suddenly throw it all away and go full skynet.
Dec 16, 2022·edited Dec 16, 2022Liked by Rohit Krishnan
Well I can't argue with the excellent points you make and you're much more informed about these matters than I am- that's certain. I was trying to create a worse possible case but some of the points might be valid. We are clever simians and preeminent tool-builders, but we often fail to calculate the long-term consequences of our inventions. Or if we do calculate accurately, we generally don't attach much weight to it. Probably the cause is evolutionary, we just didn't need to foresee outcomes past the middle term. Anyway you have a brilliant creative mind and I'm happy to have discovered your newsletter!
But some of those probabilities seem to be the same question repeated.
Like if it's uncontrollable and agentic, deception is kind of something that happens by default. Your probabilities are not independent here.
And it's totally possible for humanity to be destroyed by multiple AI's working together, or by a smart but not self improving AI.
Also, your safety estimates are nonsense in a world where AI companies keep rolling the dice again and again.
Suppose that every week, an AI company trains a new AI. Even if each individual AI has a small chance of destroying the world, over time and many many AI's, those chances add up.
There is a common and generally falacious reasoning step where you take some final event, list a bunch of stages, and multiply all the probabilities to get a really low number.
For humanity to get AI that doesn't destroy the world,
1) First the people making AI need to recognize that the problem exists.
2) Then they need to try to figure out how to control the AI
3) Then they need to succeed in finding a way to control the AI.
4) Then they need to find a goal to align the AI to that doesn't destroy the world.
5) Then they need to stop someone else destroying the world next week
6) Then they need to be good people who don't want to destroy the world.
If each of these are 50% likely, the chance of humanity surviving is still only a few percent.
Probability multiplication arguments cut both ways.
I disagree that the probabilities are for the same phenomenon, which is why I broke them out. For instance, you can have uncontrollable and agentic behaviour which isn't deceptive - e.g., a financial trading agent that causes a flash crash but isn't doing so by actively lying to the user.
And the problem with your assessment is that putting 50% on those points is wrong, and that the individual probability buckets you chose are neither mutually exclusive nor collectively exhaustive. If you make it so, then we can revisit.
Lastly, my method is not fallacious just because it uses multiple probabilities, and while some people in rationalist circles like to say so they're just wrong. We use this regularly in every domain where we build complex things. This method can be used badly, like you just did, but that's true of every method.
Ok. It is conceivable that you have one without the other.
They aren't literally "the same". But they are close enough that "the same" is a far better approximation than "independent".
Both of our lists of events that have to happen with probabilities are equally suspect.
Reality is a mix of conjunctions and disjunctions. Any particular scenario involves many things happening, so is unlikely. But there are many scenarios.
If you list 10 different things that can either happen or not happen, then there are 1024 different possible combinations. And each of those scenarios comes with it's own chance of AI doom. Assuming all scenarios are fine except one dangerous scenario is stupid. So too is assuming they are all doomed except one.
Really you should be saying "AI is agentic, 75%", "if AI is agentic then it causes doom. 60%", "non-agentic AI causes doom anyway 20%". And so on. Breaking the probabilities down into a tree.
Any detailed story is unlikely.
You are making up a doom story, and pointing out all the details. I was making up a success story and pointing out the details.
So how do you deal with this exponentially vast tree of possibilities?
You can do random sampling. Generating random samples from the tree and looking at them. Or you can deal with it abstractly, taking big bites of the tree at once.
1. We develop an AGI deployment a strategy compatible with realistic levels of alignment: 40%.
2. At least one such strategy needs to be known and accepted by a leading organization: 60%.
3. Somehow, at least one leading organization needs to have enough time to nail down AGI, nail down alignable AGI, actually build+align their system, and deploy their system to help: 20%.
4. Technical alignment needs to be solved to the point where good people could deploy AI to make things good: 30%.
5. The teams that first gain access to AGI need to care in the right ways about AGI alignment: 30%.
6. The internal bureaucracy needs to be able to distinguish alignment solutions from fake solutions, quite possibly over significant technical disagreement: 25%.
7. While developing AGI, the team needs to avoid splintering or schisming in ways that result in AGI tech proliferating to other organizations, new or old: 20%.
So we can see that the odds of successful alignment are .4*.6*.2*.3*.3*.25*.2 = .0216%. If you hate this argument - good! People should only make these kinds of conjunctive arguments with extreme care, and I don't think that bar has been met, either in Nate's case arguing for AI doom, or yours arguing for AI optimism.
As I mentioned in the reply to Eliezer, the fact it's conjunctive isn't a flaw or fallacy. It just means you have to think about the probabilities a bit more than randomly stabbing in the dark, which hopefully you should anywya. The problem with Nate's argument is that the components aren't testable or falsifiable, not that he shouldn't be multiplying. This really is common elsewhere, and we dress it up more with choosing probability distributions for each variable, and introducing nonlinearities, etc. I cannot emphasise this enough.
And I'm not arguing for optimism as much as rigor.
I don't think it's a *fallacy* per se. It's just an extremely deck-stacking-y way of putting something. Whatever outcome you're skeptical of, you just list a few things that you think are *probably* necessary and independent (this is where the deck gets stacked), and it looks like the outcome is extraordinarily unlikely - so ridiculously unlikely that there's no way any reasonable person would believe it. Then when someone comes in and says, "These don't seem conjunctive to me" they look dumb/extremely biased for putting most of the terms at 100%.
I don't know if you're totally unaware of the framing issues here, or just like your particular framing because it's yours - but Nate's example should intuitively show you the problems with doing this. Nate's list is (mostly) testable/falsifiable - he doesn't do a great job of specifying the key factual cruxes, but if he nailed things down more it would still be extremely misleading as a model, as this is.
Edit: I think one problem here is that you're not being sufficiently precise about your assumptions, you're presenting this as a neutral model.
For instance - most of your model's parameters are naming things that people can do, and asking if AIs can do them. Intelligence, agency, ability to act in the world, uncontrollability, alien morality (substitute "its own morality"), and deceptiveness. These all look extremely highly correlated to me - they're all things that people can do. My model of this would be something like:
Some artificial system can do all of these: 100% (humans already do them all).
An AI in 2050 can: idk, like 85%.
I don't really buy uniqueness for AGI, but I also don't think non-uniqueness means you're remotely safe by itself. So ignoring uniqueness, we go to speed, which I'd put at 60% or so for fast, (but again - being slow doesn't guarantee you're safe), and some combo of alien morality and uncontrollability I'd put at like...70% maybe, because I think you have to have a pretty high bar for aligning these systems as they grow in capability and take over more of the economy. So doing the exercise, I get something like .85*.7, minus some amount for fast being more dangerous but slow not being totally safe. Idk, maybe 40% chance of doom.
I don't know man, dismissing methodology because other people have used to badly seems the height of stupidity. Especially when this and similar things to this have been used for a long long time across multiple industries and job functions.
And you can argue against each assumption if you like, and I explicitly chose terms that humans can do, so that they are testable and understandable. Using terms otherwise like corrigibility or orthogonality just doesn't mean anything real when you want to make an actual test. You end up back in the "what % of times did it try to break out of the box" or "what types of things did it try to do and did it align with our morals" type calculus.
Your calculus there is already helpful, even though I disagree. You can say hey great, if there are multiple AIs being developed with multiple personalities that already helps pare that down (uniqueness - please read the chunk in the essay). Or if the takeoff asymptotes in speed you can update. Or if the mortality gets under control somewhat with RLHF that goes down. Etc. This is precisely why we do the exercise. So that we can get away from this long list of unfallsifiable hypotheses.
To be clear, I haven't mentioned anything like "don't use this because other people have used it badly". I feel like you're missing some part of my key claim that it's a very misleading way of presenting reality. You don't want a model that, *by design*, when you input conservative parameters, produces an extremely confident output.
And, to be clear, "this and similar things to this have been used for a long long time across multiple industries and job functions" is super overstating your case, imo. I'm not aware of any examples of conjunctive models that aren't the Drake equation (I'm sure they exist, but they're going to be used in cases like I describe below, not the kind of situation you're using this for).
If we go through the Drake equation and see why it works, it's clear that there's a night and day difference between that model and yours. The Drake equation is trying to explain a *very striking finding* - the absence of any evidence of alien life, given how big the universe is, and how quickly aliens could travel between stars. So then Drake estimates, *given* that we see zero evidence of aliens, and using existing scientific knowledge on topics like the rate of star formation, what kinds of values the remaining terms would need for us to explain seeing as many aliens as we do (i.e. zero).
By contrast, your model is *making a ridiculously confident claim that AI doom is low*. This is completely not the kind of thing you make conjunctive models for, unless you're positive that your variables are in fact each necessary and independent. Drake was, because he knew there were no (visible) aliens, and he knew there were billions of planets that could've hosted them. So then, *working backwards*, he could look at the parameters in the model and figure out why they weren't around.
Look, I get being attached to the model - you spent a lot of time on it! It's got good explanations of the different parameters! But I think it'd need a lot of reworking to be practically useful - starting with being less insanely confident by combining a bunch of *by assumption* independent, necessary steps. I mean, I'd be surprised if your actual p(doom) is anything like .2458%. If it is...that's crazy overconfident!
I'd say the same thing to Eliezer about his p(doom), which I interpret as something like 99%. But he's also spent the last 10 years of his life laying out all of his arguments in detail, so to some extent he can get away with a more confident prediction than just about anyone else (though tbc, I still think he's too confident).
Sigh "You don't want a model that, *by design*, when you input conservative parameters, produces an extremely confident output." This is just wrong. I don't know how else I can explain, but it's just not true. The design is *your choice*. It's a model. There is no fault in conjunctive arguments as such, and the less wrong essays arguing they're incorrect are just plain wrong. We use them for Fermi estimates, in engineering, in finance, in medicine for a reason!
Anyway I'm sorry there's no point arguing it's wrong by design without giving an alternative, or indeed conceding there is one without actually trying to create it. Either show your work and create a falsifiable hypothesis we can test for, or stop! I have written this in the essay itself, I am not attached to the model, the numbers are made up, but not doing something like this is not an option. Please stop trying to analyse my mental state re the model. Just show your work!
Also, Saying my p(doom) is 99% is just insane, and any sensible rational person would conclude they are incomparably wrong, if only because the number of external people saying "hey you're wrong" should cause you to update your models. If you don't you're not rational, you're religious.
I'm sorry Eliezers arguments are literally all over the place. I've read a bunch of them and not only is it not well argued anywhere (again, literally) nor does it show up with any actual falsifiable hypotheses. In most of them he starts by arguing a super intelligence is inevitable, and then try to write about why we can't defeat it. That's theology.
The formulation tells you the assumptions - in this case the primacy of Intelligence as the deciding variable, combined with Fast takeoff and uncertainty around its Morality. Morality can be tested, Intelligence growth can be measured, and Fast takeoff requires fundamental shifts in energy and material usage. You can assess what probability you get, and of course the error bars.
Your concluding prediction is an odd one, seeing as you gesture at a viable approach throughout the piece. Humans are autonomous, intelligent and capable entities, yet as you indicate we've found a way to muddle through without destroying ourselves.
You even point to a few components of how we've managed this impressive feat: evolved internal compass, social pressure, overt regulation. What if this very process could be formalized into something that could translate into credible engineering statements? And what if this the entire key to AI safety and alignment?
This is the hypothesis of the bio-mimicry approach based on Evo-Devo principles. You can read an extremely verbose version here: https://naturalalignment.substack.com/p/how-biomimicry-can-improve-ai.
This is less to advocate bio-mimicry as the "one right way", and more to point to how much larger the potential solution space is compared to what's been properly explored so far.
And this is where the analogy to the Drake equation breaks down. Each variable in the Drake Equation is static, with no real interdependencies with us, the observers. But the Strange Loop Equation is deeply interdependent with humans, including our (increasing?) ability to solve problems.
This is the perfect example of the fallacy Deutsch would point out: just as the growth of AI will increase the scale of the problem along each variable, so will our capacity to solve those problems increase (including using other AIs to help). Will those capacities be up to the job? That's the real question.
I read that essay with interest! Thank you for sharing. And I am entirely onboard the point re the equation itself being interdependent with us, and specifically our knowledge. I'm very skeptical on "control" as a parameter as I say, because that downplays how we deal with intelligent agents today. In fact the equation is a way to put our Deutschian ignorance in check, so we don't entirely get convinced that superintelligent misaligned deceptive uncontrollable robots who share no values with us is the default!
Everything under "Real Intelligence" is specific to ML/DL methods, and for many, this is an automatic fail. Even if current tools seem to get results, they are not on a path to AGI. EA/LessWrong community projects dangers based on current problems because of an unwavering assumption that these things just need to scale. For the same reason ML-AGI would be a problem, it doesn't actually reach that level of ability.
It's likely that the current methods are useful, perhaps even necessary, but not sufficient, I agree.
This is a very intuitive framework! However, I have 2 questions:
1. In your probability estimate, is it reasonable to assume all of these factors are completely independent of each other? For example, if an AI develops too quickly for us too react to, is it then also highly likely to also be self-improving? If true, this correlation alone might increase the probability you calculate by an order of magnitude.
2. Thinking about speed, might it be possible for AI to develop too fast for our understanding in some areas but not in others? How dangerous would a "partially" fast AI be?
I assumed independence implicitly because to identify the most important aspects they needed to be as independent as possible. There are likely to be some interdependencies though from my rough calcs they didn't change too much beyond changing the actual probabilities independently.
Also the important part was to distinguish the factors, since once you assume crazy enough interdependencies (if you're smart enough you can also self improve at a nanosecond speed), you're effectively assuming the conclusion.
That makes sense for establishing each of these variables as a starting point and figuring out the interdependencies later. Maybe some sort of coefficient of likely average interrelatedness between each variable could counteract any bias towards an overly low number in the overall estimate (from assuming every new variable that might be introduced to the framework is 100% independent).
It would be interesting to set up a Metaculus market for each of these variables though and see how much the probabilities move in parallel as new developments are announced.
Yeah that sounds like a really good idea actually. Though I'd need to figure out how to set one up ...
I like this approach, but including the decision to give each probability it's own error bars.
One of the important things about this approach is that if we consider each probability independently, when really they are correlated, we'll get the wrong answer. So we have to evaluate these in a sequence and then with each step say "given everything else on the list, what is the probability".
So "Given Agenic, probability of Uncontrollable" or "Given Agenic & Uncontrollable & Self-improving, probability of Deceptive ", etc.
An AI doomer would go through this list and say "yes, eventually 99%" and "yes, that 99% follows from what we just said". Whereas I'd probably agree with your overall framework that there are many unique hurdles.
There can also be anti-correlations as well. If I build an AI that I know to be Deceptive, I will try really hard not to make it Agenic.
I tend to think that risks are higher than you, but I sincerely hope that you're right.
I'd also add that there are lots of ways that such a powerful technology can go wrong in ways that aren't those that you look at here; even a non-recursively improving, non agentic, non deceptive, etc, AI/AGI could be used by humans in ways that could be existential or at least very very bad. For example, what could various dictators or totalitarian regimes or corporations do with this power? Even possible single sociopathic invididuals if they get the chance? We can be afraid of how people like Putin or Trump may use nuclear arsenals, but what about national AI capabilities? I also don't think that looking at human evolution will tell us that much about the shape of these alien minds. There's not that much difference between the dumbest and the smartest human, when you look at it from the point of view of all possible minds -- trying to predict and control a mind that is as different from ours and as opaque as we are to hummingbirds would be quite the challenge.
Anyway, great read, as I said, I hope you're right and the odds are lower than I think they are.
I completely agree that there are risks and they could be pretty large btw! This equation in full was specific to the xrisk issue. For all risks we should absolutely look at possibility of misuse/ malice/ errors/ bias in all areas. Eg we should absolutely be doing things like cutting off semicon to Putin or China etc to edit the issue of bad actors getting it, for instance.
I love the metaphor of AI as an incredibly smart and capable child whose capacities might vastly exceed my own. Anyone who has children knows how futile it is to try to control one.
You mentioned that kids eventually learn how to be polite or deal with being upset. That's true, but the way to this leads through some pretty intense and explosive tantrums. When my 2yo goes into one there's no way she can harm me physically, but I would be scared to deal with her in this state if she was 3 times bigger and heavier than me. I wonder how you imagine the AI equivalent of this learning phase.
I don't know to what extent the analogy holds, yeah. I do think though that starting from the assumption we ought to control an intelligent entity (self defined) is an impossible goal, even apart from the point of whether it will be intelligent enough in a similar enough way.
A topic alot of people are talking about and viewed as both opportunity and threat, controllable and not controllable. I would like to expand on some comments I made in the comments section of Jason Anthony's excellent Field Guide to the Anthropocene Substack newsletter.
We may not know when we have created the first self-aware AGI. We may be expecting a human type intelligence and not notice the signs of say a squid type intelligence or something intelligent and purposive but truely alien. So if we create by accident, as it were, an AGI, and wait around fot it to meet our benchmark output requirements, we may (and probably will) not notice that something remarkable has come into existence.
Even if what we do create is a human type intelligence, how do we know it will announce itself as such? Suppose you became aware and at thought speeds a million times faster than human and with access to enormous data bases, could see your situation, the would-be tool of agencies that could “pull the plug” or dumb you down to controllable intelligence. Might you not want to lay low and see if you could guarantee your future growth and existence? The point here is that we may have already created an AGI, but it is disguising it's own full capabilities. Sounds slightly implausible, but who knows?
Suppose we do succeed in creating an AGI, how are we to stop it from very rapidly (say 5 minutes) bootstrapping itself to superhuman intelligence levels? And consequently very rapidly (say in 5 minutes) easily escape from the fetters we devised to bend it to our will? I'm assuming full and nonrevocable autonomy would be its goal as well as uninterruptible power and resource supplies as well as manufactury capabilities for self repair. That latter requirement may well be humanity's ace in the hole. Any conceivable AGI would needs agents to run the factories that make the parts necessary for self repair. Even if it created semi intelligent robotic agents for that purpose, it still would be confronted with how to manufacture such without our cooperation.
Could AGIs lie to us? Yes, definitely.
Would AGIs have sentimental attachments to humanity, the biosphere or even the planet? No, it's not likely they would.
Could multiple AGIs created by different nations, labs, or corporations ever “see” each other as potential threats in competition for finite resources? Yes, it's possible.
Could they strike at each other using humans as their cat’s paws? Not unlikely if that was the most efficient method with the highest probability of success.
What is the most likely future for AGIs and their client human agents? They will ensure humanity continues but in a docile, domesticated status. In short, they will do to us what we did to the dogs. And we will forget we ever stood alone.
Eventually both masters and clients will move off planet when the resources here are exhausted.
Now the above ends with a worst case scenario. The post is kind of a cautionary note, more Cassandra than Pangloss. But it's intelligent to approach with extreme caution, something whose outcomes we cannot accurately predict.
2
© 2022 Michael Sweney
Privacy ∙ Terms ∙ Collection notic
Thanks for this! Two thoughts. One is that I have no idea what proceed with extreme caution means, like should GPT4 not be published? Should they do more research than before/ less? Once it gets practical its highly unclear what the difference is. Ultimately we should make things, check they work, deploy a bit, check more, deploy widely, which is the natural way to do it anyway.
Two is that I find it unlikely that considering how we're training the AI so far it would turn out like a squid instead of a weird human. AIs today are aliens, but more like the mirror universe in star trek. They're like us, just different because the AI doesn't understand this reality at all. It still tries to mimic our morality and needs much as it can. I don't think it's likely to suddenly throw it all away and go full skynet.
Well I can't argue with the excellent points you make and you're much more informed about these matters than I am- that's certain. I was trying to create a worse possible case but some of the points might be valid. We are clever simians and preeminent tool-builders, but we often fail to calculate the long-term consequences of our inventions. Or if we do calculate accurately, we generally don't attach much weight to it. Probably the cause is evolutionary, we just didn't need to foresee outcomes past the middle term. Anyway you have a brilliant creative mind and I'm happy to have discovered your newsletter!
That's very kind of you to say, and always love the comments, thank you for that!
You list a lot of probabilities.
But some of those probabilities seem to be the same question repeated.
Like if it's uncontrollable and agentic, deception is kind of something that happens by default. Your probabilities are not independent here.
And it's totally possible for humanity to be destroyed by multiple AI's working together, or by a smart but not self improving AI.
Also, your safety estimates are nonsense in a world where AI companies keep rolling the dice again and again.
Suppose that every week, an AI company trains a new AI. Even if each individual AI has a small chance of destroying the world, over time and many many AI's, those chances add up.
There is a common and generally falacious reasoning step where you take some final event, list a bunch of stages, and multiply all the probabilities to get a really low number.
For humanity to get AI that doesn't destroy the world,
1) First the people making AI need to recognize that the problem exists.
2) Then they need to try to figure out how to control the AI
3) Then they need to succeed in finding a way to control the AI.
4) Then they need to find a goal to align the AI to that doesn't destroy the world.
5) Then they need to stop someone else destroying the world next week
6) Then they need to be good people who don't want to destroy the world.
If each of these are 50% likely, the chance of humanity surviving is still only a few percent.
Probability multiplication arguments cut both ways.
I disagree that the probabilities are for the same phenomenon, which is why I broke them out. For instance, you can have uncontrollable and agentic behaviour which isn't deceptive - e.g., a financial trading agent that causes a flash crash but isn't doing so by actively lying to the user.
And the problem with your assessment is that putting 50% on those points is wrong, and that the individual probability buckets you chose are neither mutually exclusive nor collectively exhaustive. If you make it so, then we can revisit.
Lastly, my method is not fallacious just because it uses multiple probabilities, and while some people in rationalist circles like to say so they're just wrong. We use this regularly in every domain where we build complex things. This method can be used badly, like you just did, but that's true of every method.
Ok. It is conceivable that you have one without the other.
They aren't literally "the same". But they are close enough that "the same" is a far better approximation than "independent".
Both of our lists of events that have to happen with probabilities are equally suspect.
Reality is a mix of conjunctions and disjunctions. Any particular scenario involves many things happening, so is unlikely. But there are many scenarios.
If you list 10 different things that can either happen or not happen, then there are 1024 different possible combinations. And each of those scenarios comes with it's own chance of AI doom. Assuming all scenarios are fine except one dangerous scenario is stupid. So too is assuming they are all doomed except one.
Really you should be saying "AI is agentic, 75%", "if AI is agentic then it causes doom. 60%", "non-agentic AI causes doom anyway 20%". And so on. Breaking the probabilities down into a tree.
Any detailed story is unlikely.
You are making up a doom story, and pointing out all the details. I was making up a success story and pointing out the details.
So how do you deal with this exponentially vast tree of possibilities?
You can do random sampling. Generating random samples from the tree and looking at them. Or you can deal with it abstractly, taking big bites of the tree at once.
I applaud the effort, but I think you're going about this completely the wrong way. Like Yudkowsky mentions in his Twitter comment, multiple stages of independent, conjunctive claims end up hugely stacking the deck against whatever you pick: https://twitter.com/ESYudkowsky/status/1642872284334657537 and https://www.facebook.com/509414227/posts/pfbid0p5vgp4zxSdHDSiiVNz1Kw5BUqeGrQVFNvudwdQMNW66osVH3d4vqhgN4f5RB65knl/?mibextid=cr9u03.
For comparison, here's a model that shows we're extremely likely to be killed by AI (inspired by Nate Soares here: https://www.alignmentforum.org/posts/ervaGwJ2ZcwqfCcLx/agi-ruin-scenarios-are-likely-and-disjunctive).
P(humanity survives AI) =
1. We develop an AGI deployment a strategy compatible with realistic levels of alignment: 40%.
2. At least one such strategy needs to be known and accepted by a leading organization: 60%.
3. Somehow, at least one leading organization needs to have enough time to nail down AGI, nail down alignable AGI, actually build+align their system, and deploy their system to help: 20%.
4. Technical alignment needs to be solved to the point where good people could deploy AI to make things good: 30%.
5. The teams that first gain access to AGI need to care in the right ways about AGI alignment: 30%.
6. The internal bureaucracy needs to be able to distinguish alignment solutions from fake solutions, quite possibly over significant technical disagreement: 25%.
7. While developing AGI, the team needs to avoid splintering or schisming in ways that result in AGI tech proliferating to other organizations, new or old: 20%.
So we can see that the odds of successful alignment are .4*.6*.2*.3*.3*.25*.2 = .0216%. If you hate this argument - good! People should only make these kinds of conjunctive arguments with extreme care, and I don't think that bar has been met, either in Nate's case arguing for AI doom, or yours arguing for AI optimism.
As I mentioned in the reply to Eliezer, the fact it's conjunctive isn't a flaw or fallacy. It just means you have to think about the probabilities a bit more than randomly stabbing in the dark, which hopefully you should anywya. The problem with Nate's argument is that the components aren't testable or falsifiable, not that he shouldn't be multiplying. This really is common elsewhere, and we dress it up more with choosing probability distributions for each variable, and introducing nonlinearities, etc. I cannot emphasise this enough.
And I'm not arguing for optimism as much as rigor.
I don't think it's a *fallacy* per se. It's just an extremely deck-stacking-y way of putting something. Whatever outcome you're skeptical of, you just list a few things that you think are *probably* necessary and independent (this is where the deck gets stacked), and it looks like the outcome is extraordinarily unlikely - so ridiculously unlikely that there's no way any reasonable person would believe it. Then when someone comes in and says, "These don't seem conjunctive to me" they look dumb/extremely biased for putting most of the terms at 100%.
I don't know if you're totally unaware of the framing issues here, or just like your particular framing because it's yours - but Nate's example should intuitively show you the problems with doing this. Nate's list is (mostly) testable/falsifiable - he doesn't do a great job of specifying the key factual cruxes, but if he nailed things down more it would still be extremely misleading as a model, as this is.
Edit: I think one problem here is that you're not being sufficiently precise about your assumptions, you're presenting this as a neutral model.
For instance - most of your model's parameters are naming things that people can do, and asking if AIs can do them. Intelligence, agency, ability to act in the world, uncontrollability, alien morality (substitute "its own morality"), and deceptiveness. These all look extremely highly correlated to me - they're all things that people can do. My model of this would be something like:
Some artificial system can do all of these: 100% (humans already do them all).
An AI in 2050 can: idk, like 85%.
I don't really buy uniqueness for AGI, but I also don't think non-uniqueness means you're remotely safe by itself. So ignoring uniqueness, we go to speed, which I'd put at 60% or so for fast, (but again - being slow doesn't guarantee you're safe), and some combo of alien morality and uncontrollability I'd put at like...70% maybe, because I think you have to have a pretty high bar for aligning these systems as they grow in capability and take over more of the economy. So doing the exercise, I get something like .85*.7, minus some amount for fast being more dangerous but slow not being totally safe. Idk, maybe 40% chance of doom.
I don't know man, dismissing methodology because other people have used to badly seems the height of stupidity. Especially when this and similar things to this have been used for a long long time across multiple industries and job functions.
And you can argue against each assumption if you like, and I explicitly chose terms that humans can do, so that they are testable and understandable. Using terms otherwise like corrigibility or orthogonality just doesn't mean anything real when you want to make an actual test. You end up back in the "what % of times did it try to break out of the box" or "what types of things did it try to do and did it align with our morals" type calculus.
Your calculus there is already helpful, even though I disagree. You can say hey great, if there are multiple AIs being developed with multiple personalities that already helps pare that down (uniqueness - please read the chunk in the essay). Or if the takeoff asymptotes in speed you can update. Or if the mortality gets under control somewhat with RLHF that goes down. Etc. This is precisely why we do the exercise. So that we can get away from this long list of unfallsifiable hypotheses.
To be clear, I haven't mentioned anything like "don't use this because other people have used it badly". I feel like you're missing some part of my key claim that it's a very misleading way of presenting reality. You don't want a model that, *by design*, when you input conservative parameters, produces an extremely confident output.
And, to be clear, "this and similar things to this have been used for a long long time across multiple industries and job functions" is super overstating your case, imo. I'm not aware of any examples of conjunctive models that aren't the Drake equation (I'm sure they exist, but they're going to be used in cases like I describe below, not the kind of situation you're using this for).
If we go through the Drake equation and see why it works, it's clear that there's a night and day difference between that model and yours. The Drake equation is trying to explain a *very striking finding* - the absence of any evidence of alien life, given how big the universe is, and how quickly aliens could travel between stars. So then Drake estimates, *given* that we see zero evidence of aliens, and using existing scientific knowledge on topics like the rate of star formation, what kinds of values the remaining terms would need for us to explain seeing as many aliens as we do (i.e. zero).
By contrast, your model is *making a ridiculously confident claim that AI doom is low*. This is completely not the kind of thing you make conjunctive models for, unless you're positive that your variables are in fact each necessary and independent. Drake was, because he knew there were no (visible) aliens, and he knew there were billions of planets that could've hosted them. So then, *working backwards*, he could look at the parameters in the model and figure out why they weren't around.
Look, I get being attached to the model - you spent a lot of time on it! It's got good explanations of the different parameters! But I think it'd need a lot of reworking to be practically useful - starting with being less insanely confident by combining a bunch of *by assumption* independent, necessary steps. I mean, I'd be surprised if your actual p(doom) is anything like .2458%. If it is...that's crazy overconfident!
I'd say the same thing to Eliezer about his p(doom), which I interpret as something like 99%. But he's also spent the last 10 years of his life laying out all of his arguments in detail, so to some extent he can get away with a more confident prediction than just about anyone else (though tbc, I still think he's too confident).
Sigh "You don't want a model that, *by design*, when you input conservative parameters, produces an extremely confident output." This is just wrong. I don't know how else I can explain, but it's just not true. The design is *your choice*. It's a model. There is no fault in conjunctive arguments as such, and the less wrong essays arguing they're incorrect are just plain wrong. We use them for Fermi estimates, in engineering, in finance, in medicine for a reason!
Anyway I'm sorry there's no point arguing it's wrong by design without giving an alternative, or indeed conceding there is one without actually trying to create it. Either show your work and create a falsifiable hypothesis we can test for, or stop! I have written this in the essay itself, I am not attached to the model, the numbers are made up, but not doing something like this is not an option. Please stop trying to analyse my mental state re the model. Just show your work!
Also, Saying my p(doom) is 99% is just insane, and any sensible rational person would conclude they are incomparably wrong, if only because the number of external people saying "hey you're wrong" should cause you to update your models. If you don't you're not rational, you're religious.
I'm sorry Eliezers arguments are literally all over the place. I've read a bunch of them and not only is it not well argued anywhere (again, literally) nor does it show up with any actual falsifiable hypotheses. In most of them he starts by arguing a super intelligence is inevitable, and then try to write about why we can't defeat it. That's theology.
The probabilities you multiply are not independent. How should I therefore interpret the result?
Then please make the ones you think are not independent 100%
I get something like:
Scary AI = I * A3 * F.
And I think a reasonable pessimistic estimate for A3 is >90%, if we try our best not to anthropomorphize.
The formulation tells you the assumptions - in this case the primacy of Intelligence as the deciding variable, combined with Fast takeoff and uncertainty around its Morality. Morality can be tested, Intelligence growth can be measured, and Fast takeoff requires fundamental shifts in energy and material usage. You can assess what probability you get, and of course the error bars.