Roman Yampolskiy

29 February 2024

The conversation continues: Reducing risks of AI catastrophe

Filed under: AGI, risks — Tags: AGI, BGI24, Roman Yampolskiy — David Wood @ 4:36 am

I wasn’t expecting to return to this topic quite so quickly.

When the announcement was made on the afternoon of the second full day of the Beneficial General Intelligence summit about the subjects for the “Interactive Working Group” round tables, I was expecting that a new set of topics would be proposed, different to those of the first afternoon. However, the announcement was simple: it would be the same topics again.

This time, it was a different set of people who gathered at this table – six new attendees, plus two of us – Roman Yampolskiy and myself – who had taken part in the first discussion.

(My notes from that first discussion are here, but you should be able to make sense of the following comments even if you haven’t read those previous notes.)

The second conversation largely went in a different direction to what had been discussed the previous afternoon. Here’s my attempt at a summary.

1. Why would a superintelligent AI want to kill large numbers of humans?

First things first. Set aside for the moment any thoughts of trying to control a superintelligent AI. Why would such an AI need to be controlled? Why would such an AI consider inflicting catastrophic harm on a large segment of humanity?

One answer is that an AI that is trained by studying human history will find lots of examples of groups of humans inflicting catastrophic harm on each other. An AI that bases its own behaviour on what it infers from human history might decide to replicate that kind of behaviour – though with more deadly impact (as the great intelligence it possesses will give it more ways to carry out its plans).

A counter to that line of thinking is that a superintelligent AI will surely recognise that such actions are contrary to humanity’s general expressions of moral code. Just because humans have behaved in a particularly foul way, from time to time, it does not follow that a superintelligent AI will feel that it ought to behave in a similar way.

At this point, a different reason becomes important. It is that the AI may decide that it is in its own rational self-interest to seriously degrade the capabilities of humans. Otherwise, humans may initiate actions that would pose an existential threat to the AI:

Humans might try to switch off the AI, for any of a number of reasons
Humans might create a different kind of superintelligent AI that would pose a threat to the first one.

That’s the background to a suggestion that was made during the round table: humans should provide the AI with cast-iron safety guarantees that they will never take actions that would jeopardise the existence of the AI.

For example (and this is contrary to what humans often propose), no remote tamperproof switch-off mechanism should ever be installed in that AI.

Because of these guarantees, the AI will lose any rationale for killing large numbers of humans, right?

However, given the evident fickleness and unreliability of human guarantees throughout history, why would an AI feel justified in trusting such guarantees?

Worse, there could be many other reasons for an AI to decide to kill humans.

The analogy is that humans have lots of different reasons why they kill various animals:

They fear that the animal may attack and kill them
They wish to eat the animal
They wish to use parts of the animal’s body for clothing or footwear
They wish to reduce the population of the animals in question, for ecological management purposes
They regard killing the animal as being part of a sport
They simply want to use for another purpose the land presently occupied by the animal, and they cannot be bothered to relocate the animal elsewhere.

Even if an animal (assuming it could speak) promises to humans that it will not attack and kill them – the analogy of the safety guarantees proposed earlier – that still leaves lots of reasons why the animal might suffer a catastrophic fate at the hands of humans.

So also for the potential fate of humans at the hands of an AI.

2. Rely on an objective ethics?

Continuing the above line of thought, shouldn’t a superintelligent AI work out for itself that it would be ethically wrong for it to cause catastrophic harm to humans?

Consider what has been called “the expansion of humanity’s moral circle” over the decades (this idea has been discussed by Jacy Reese Anthis among others). That circle of concern has expanded to include people from different classes, races, and genders; more recently, greater numbers of animal species are being included in this circle of concern.

Therefore, shouldn’t we expect that a superintelligent AI will place humans within the circle of creatures where the AI has an moral concern?

However, this view assumes a central role for humans in any moral calculus. It’s possible that a superintelligent AI may use a different set of fundamental principles. For example, it may prioritise much greater biodiversity on earth, and would therefore drastically reduce the extent of human occupation of the planet.

Moreover, this view assumes giving primacy for moral calculations within the overall decision-making processes followed by the AI. Instead, the AI may reason to itself:

According to various moral considerations, humans should suffer no catastrophic harms
But according to some trans-moral considerations, a different course of action is needed, in which humans would suffer that harm as a side-effect
The trans-moral considerations take priority, therefore it’s goodbye to humanity

You may ask: what on earth is a trans-moral consideration? The answer is that the concept is hypothetical, and represents any unknown feature that emerges in the mind of the superintelligent AI.

It is, therefore, fraught with danger to assume that the AI will automatically follow an ethical code that prioritises human flourishing.

3. Develop an AI that is not only superintelligent but also superwise?

Again staying with this line of thought, how about ensuring that human-friendly moral considerations are deeply hard-wired into the AI that is created?

We might call such an AI not just “superintelligent” but “superwise”.

Another alternative name would be “supercompassionate”.

This innate programming would avoid the risk that the AI would develop a different moral (or trans-moral) system via its own independent thinking.

However, how can we be sure that the moral programming will actually stick?

The AI may observe that the principles we have tried to program into it are contradictory, or are in violation with fundamental physical reality, in ways that humans had not anticipated.

To resolve that contradiction, the AI may jettison some or all of the moral code we tried to place into it.

We might try to address this possibility by including simpler, clearer instructions, such as “do not kill” and “always tell the truth”.

However, as works of fiction have frequently pointed out, simple-sounding moral laws are subject to all sorts of ambiguity and potential misunderstanding. (The writer Darren McKee provides an excellent discussion of this complication in his recent book Uncontrollable.)

That’s not to say this particular project is doomed. But it does indicate that a great deal of work remains to be done, in order to define and then guarantee “superwise” behaviours.

Moreover, even if some superintelligent AIs are created to be superwise, risks of catastrophic human harms will still arise from any non-superwise superintelligent AIs that other developers create.

4. Will a diverse collection of superintelligent AIs constrain each other?

If a number of different superintelligent AIs are created, what kind of coexistence is likely to arise?

One idea, championed by David Brin, is that the community of such AIs will adopt the practices of mutual monitoring and reciprocal accountability.

After all, that’s what happens among humans. We keep each other’s excesses in check. A human who disregards these social obligations may gain a temporary benefit, but will suffer exclusion sooner or later.

In this thinking, rather than just creating a “singleton” AI superintelligence, we humans should create a diverse collection of such beings. These beings will soon develop a system of mutual checks and balances.

However, that’s a different assumption from the one mentioned in the previous section, in which catastrophic harm may still befall humans, when the existence of a superwise AI is insufficient to constrain the short-term actions of a non-superwise AI.

For another historical analysis, consider what happened to the native peoples of North America when their continent was occupied not just by one European colonial power but by several competing such powers. Did the multiplicity of superpowerful colonial powers deter these different powers from inflicting huge casualties (intentionally and unintentionally) on the native peoples? Far from it.

In any case, a system of checks and balances relies on a rough equality in power between the different participants. That was the case during some periods in human history, but by no means always. And when we consider different superintelligent AIs, we have to bear in mind that the capabilities of any one of these might suddenly catapult forward, putting it temporarily into a league of its own. For that brief moment in time, it would be rationally enlightened for that AI to destroy or dismantle its potential competitors. In other words, the system would be profoundly unstable.

5. Might superintelligent AIs decide to leave humans alone?

(This part of the second discussion echoed what I documented as item 9 for the discussion on the previous afternoon.)

Once superintelligent AIs are created, they are likely to self-improve quickly, and they may soon decide that a better place for them to exist is somewhere far from the earth. That is, as in the conclusion of the film Her, the AIs might depart into outer space, or into some kind of inner space.

However, before they depart, they may still inflict damage on humans,

Perhaps to prevent us from interfering with whatever system supports their inner space existence
Perhaps because they decide to use large parts of the earth to propel themselves to wherever they want to go.

Moreover, given that they might evolve in ways that we cannot predict, it’s possible that at least some of the resulting new AIs will choose to stay on earth for a while longer, posing the same set of threats to humans as is covered in all the other parts of this discussion.

6. Avoid creating superintelligent AI?

(This part of the second discussion echoed what I documented as item 4 for the discussion on the previous afternoon.)

More careful analysis may determine a number of features of superintelligent AI that pose particular risks to humanity – risks that are considerably larger than those posed by existing narrow AI systems.

For example, it may be that it is general reasoning capability that pushes AI over the line from “sometimes dangerous” to “sometimes catastrophically dangerous”.

In that case, the proposal is:

Avoid these features in the design of new generations of AI
Avoid including any features into new generations of AI from which these particularly dangerous features might evolve or emerge

AIs that have these restrictions may nevertheless still be especially useful for humanity, delivering sustainable superabundance, including solutions to diseases, aging, economic deprivation, and exponential climate change.

However, even though some development organisations may observe and enforce these restrictions, it is likely that other organisations will break the rules – if not straightaway, then within a few years (or decades at the most). The attractions of more capable AIs will be too tempting to resist.

7. Changing attitudes around the world?

To take stock of the discussion so far (in both of the two roundtable session on the subject):

A number of potential solutions have been identified, that could reduce the risks of catastrophic harm
This includes just building narrow AI, or building AI that is not only superintelligent but also superwise
However, enforcing these design decisions on all AI developers around the world seems an impossible task
Given the vast power of the AI that will be created, it just takes one rogue actor to imperil the entire human civilisation.

The next few sections consider various ways to make progress with point 3 in that list.

The first idea is to spread clearer information around the world about the scale of the risks associated with more powerful AI. An education programme is needed such as the world has never seen before.

Good films and other media will help with this educational programme – although bad films and other media will set it back.

Examples of good media include the Slaughterbots videos made by FLI, and the film Ex Machina (which packs a bigger punch on a second viewing than on the first viewing).

As another comparison, consider also the 1983 film The Day After which transformed public opinion about the dangers of a nuclear war.

However, many people are notoriously resistant to having their minds changed. The public reaction to the film Don’t Look Up is an example: many people continue to pay little attention to the risks of accelerating climate change, despite the powerful message of that film.

Especially when someone’s livelihood, or their sense of identity or tribal affiliation, is tied up with a particular ideological commitment, they are frequently highly resistant to changing their minds.

8. Changing mental dispositions around the world?

This idea might be the craziest on the entire list, but, to speak frankly, it seems we need to look for and embrace ideas which we would previously have dismissed as crazy.

The idea is to seek to change, not only people’s understanding of the facts of AI risk, but also their mental dispositions.

Rather than accepting the mix of anger, partisanship, pride, self-righteousness, egotism, vengefulness, deceitfulness, and so on, that we have inherited from our long evolutionary background, how about using special methods to transform our mental dispositions?

Methods are already known which can lead people into psychological transformation, embracing compassion, humility, kindness, appreciation, and so on. These methods include various drugs, supplements, meditative practices, and support from electronic and computer technologies.

Some of these methods have been discussed for millennia, whereas others have only recently become possible. The scientific understanding of these methods is still at an early stage, but it arguably deserves much more focus. Progress in recent years has been disappointingly slow at times (witness the unfounded hopes in this forward looking article of mine from 2013), but that pattern is common for breakthroughs in technology and/or therapies which can move from disappointingly slow to shockingly fast.

The idea is that these transformational methods will improve the mental qualities of people all around the world, allowing us all to transcend our previous perverse habit of believing only the things that are appealing to our psychological weaknesses. We’ll end up with better voters and (hence) better politicians – as well as better researchers, better business leaders, better filmmakers, and better developers and deployers of AI solutions.

It’s a tough ask, but it may well be the right ask at this crucial moment in cosmic history.

9. Belt and braces: monitoring and sanctions?

Relying on people around the world changing their mental outlooks for the better – and not backtracking or relapsing into former destructive tendencies – probably sounds like an outrageously naïve proposal.

Such an assessment would be correct – unless the proposal is paired with a system of monitoring and compliance.

Knowing that they are being monitored can be a useful aid to encouraging people to behave better.

That encouragement will be strengthened by the knowledge that non-compliance will result in an escalating series of economic sanctions, enforced by a growing alliance of nations.

For further discussion of the feasibility of systems of monitoring and compliance, see scenario 4, “The narrow corridor: Striking and keeping the right balance”, in my article “Four scenarios for the transition to AGI”.

10. A better understanding of what needs to be changed?

One complication in this whole field is that the risks of AI cannot be managed in isolation from other dangerous trends. We’re not just living in a time of growing crisis; we’re living in what has been called a “polycrisis”:

Cascading and connected crises… a cluster of related global risks with compounding effects, such that the overall impact exceeds the sum of each part.

For one analysis of the overlapping set of what I have called “landmines”, see this video.

From one point of view, this insight complicates the whole situation with AI catastrophic risk.

But it is also possible that the insight could lead to a clearer understanding of a “critical choke point” where, if suitable pressure is applied, the whole network of cascading risks is made safer.

This requires a different kind of thinking: systems thinking.

And it will also require us to develop better analysis tools to map and understand the overall system.

These tools would be a form of AI. Created with care (so that their output can be verified and then trusted), such tools would make a vital difference to our ability to identify the right choke point(s) and to apply suitable pressure.

These choke points may turn out to be ideas already covered above: a sustained new educational programme, coupled with an initiative to assist all of us to become more compassionate. Or perhaps something else will turn out to be more critical.

We won’t know, until we have done the analysis more carefully.

Comments (2)

26 February 2024

How careful do AI safety red teams need to be?

Filed under: AGI, safety — Tags: AI safety, BGI24, gain of function, malevolent by design, red teams, Roman Yampolskiy, transcendent questions — David Wood @ 10:14 pm

In my quest to catalyse a more productive conversation about the future of AI, I’m keen to raise “transcendent questions” – questions that can help all of us to rise above the familiar beaten track of the positions we reflexively support and the positions we reflexively oppose.

I described “transcendent questions” in a previous article of mine in Mindplex Magazine:

Transcendent Questions On The Future Of AI: New Starting Points For Breaking The Logjam Of AI Tribal Thinking

…

These questions are potential starting points for meaningful non-tribal open discussions. These questions have the ability to trigger a suspension of ideology.

Well, now that I’ve arrived at the BGI24 conference, I’d like to share another potential transcendent question. It’s on the subject of limits on what AI safety red teams can do.

The following chain of thought was stimulated by me reading in Roman Yampolskiy’s new book “AI: Unexplainable, Unpredictable, Uncontrollable” about AI that is “malevolent by design”.

My first thought on coming across that phrase was, surely everyone will agree that the creation of “malevolent by design” AI is a bad idea. But then I realised that – as is so often the case in contemplating the future of advanced AI – things may be more complicated. And that’s where red teams come into the picture.

Here’s a definition of a “red team” from Wikipedia:

A red team is a group that pretends to be an enemy, attempts a physical or digital intrusion against an organization at the direction of that organization, then reports back so that the organization can improve their defenses. Red teams work for the organization or are hired by the organization. Their work is legal, but can surprise some employees who may not know that red teaming is occurring, or who may be deceived by the red team.

The idea is well-known. In my days in the mobile computing industry at Psion and Symbian, ad hoc or informal red teams often operated, to try to find flaws in our products before these products were released into the hands of partners and customers.

Google have written about their own “AI Red Team: the ethical hackers making AI safer”:

Google Red Team consists of a team of hackers that simulate a variety of adversaries, ranging from nation states and well-known Advanced Persistent Threat (APT) groups to hacktivists, individual criminals or even malicious insiders. The term came from the military, and described activities where a designated team would play an adversarial role (the “Red Team”) against the “home” team.

As Google point out, a red team is more effective if it takes advantage of knowledge about potential security issues and attack vectors:

Over the past decade, we’ve evolved our approach to translate the concept of red teaming to the latest innovations in technology, including AI. The AI Red Team is closely aligned with traditional red teams, but also has the necessary AI subject matter expertise to carry out complex technical attacks on AI systems. To ensure that they are simulating realistic adversary activities, our team leverages the latest insights from world class Google Threat Intelligence teams like Mandiant and the Threat Analysis Group (TAG), content abuse red teaming in Trust & Safety, and research into the latest attacks from Google DeepMind.

Here’s my first question – a gentle warm-up question. Do people agree that companies and organisations that develop advanced AI systems should use something like red teams to test their own products before they are released?

But the next question is the one I wish to highlight. What limits (if any) should be put on what a red team can do?

The concern is that a piece of test malware may in some cases turn out to be more dangerous than the red team foresaw.

For example, rather than just probing the limits of an isolated AI system in a pre-release environment, could test malware inadvertently tunnel its way out of its supposed bounding perimeter, and cause havoc more widely?

Oops. We didn’t intend our test malware to be *that* clever.

If that sounds hypothetical, consider the analogous question about gain-of-function research with biological pathogens. In that research, pathogens are given extra capabilities, in order to assess whether potential counter-measures could be applied quickly enough if a similar pathogen were to arise naturally. However, what if these specially engineered test pathogens somehow leak from laboratory isolation into the wider world? Understandably, that possibility has received considerable attention. Indeed, as Wikipedia reports, the United States imposed a three-year long moratorium on gain-of-function research from 2014 to 2017:

From 2014 to 2017, the White House Office of Science and Technology Policy and the Department of Health and Human Services instituted a gain-of-function research moratorium and funding pause on any dual-use research into specific pandemic-potential pathogens (influenza, MERS, and SARS) while the regulatory environment and review process were reconsidered and overhauled. Under the moratorium, any laboratory who conducted such research would put their future funding (for any project, not just the indicated pathogens) in jeopardy. The NIH has said 18 studies were affected by the moratorium.

The moratorium was a response to laboratory biosecurity incidents that occurred in 2014, including not properly inactivating anthrax samples, the discovery of unlogged smallpox samples, and injecting a chicken with the wrong strain of influenza. These incidents were not related to gain-of-function research. One of the goals of the moratorium was to reduce the handling of dangerous pathogens by all laboratories until safety procedures were evaluated and improved.

Subsequently, symposia and expert panels were convened by the National Science Advisory Board for Biosecurity (NSABB) and National Research Council (NRC). In May 2016, the NSABB published “Recommendations for the Evaluation and Oversight of Proposed Gain-of-Function Research”. On 9 January 2017, the HHS published the “Recommended Policy Guidance for Departmental Development of Review Mechanisms for Potential Pandemic Pathogen Care and Oversight” (P3CO). This report sets out how “pandemic potential pathogens” should be regulated, funded, stored, and researched to minimize threats to public health and safety.

On 19 December 2017, the NIH lifted the moratorium because gain-of-function research was deemed “important in helping us identify, understand, and develop strategies and effective countermeasures against rapidly evolving pathogens that pose a threat to public health.”

As for potential accidental leaks of biological pathogens engineered with extra capabilities, so also for potential accidental leaks of AI malware engineered with extra capabilities. In both cases, unforeseen circumstances could lead to these extra capabilities running amok in the wider world.

Especially in the case of AI systems which are already only incompletely understood, and where new properties appear to emerge in new circumstances, who can be sure what outcomes may arise?

One counter is that the red teams will surely be careful in the policing of the perimeters they set up to confine their tests. But can we be sure they have thought through every possibility? Or maybe a simple careless press of the wrong button – a mistyped parameter or an incomplete prompt – would temporary open a hole in the perimeter. The test AI malware would be jail-broken, and would now be real-world AI malware – potentially evading all attempts to track it and shut it down.

Oops.

My final question (for now) is: if it is agreed that constraints should be applied on how red teams operate, how will these constraints be overseen?

Postscript – for some additional scenarios involving the future of AI safety, take a look at my article “Cautionary Tales And A Ray Of Hope”.

dw2