dw2

2 March 2024

Our moral obligation toward future sentient AIs?

Filed under: AGI, risks — Tags: , , , , — David Wood @ 3:36 pm

I’ve noticed a sleight of hand during some discussions at BGI24.

To be clear, it has been a wonderful summit, which has given me lots to think about. I’m also grateful for the many new personal connections I’ve been able to make here, and for the chance to deepen some connections with people I’ve not seen for a while.

But that doesn’t mean I agree with everything I’ve heard at BGI24!

Consider an argument about our moral obligation toward future sentient AIs.

We can already imagine these AIs. Does that mean it would be unethical for us to prevent these sentient AIs from coming into existence?

Here’s the context for the argument. I have been making the case that one option which should be explored as a high priority, to reduce the risks of catastrophic harm from the more powerful advanced AI of the near future, is to avoid the inclusion or subsequent acquisition of features that would make the advanced AI truly dangerous.

It’s an important research project in its own right to determine what these danger-increasing features would be. However, I have provisionally suggested we explore avoiding advanced AIs with:

  • Autonomous will
  • Fully general reasoning.

You can see these suggestions of mine in the following image, which was the closing slide from a presentation I gave in a BGI24 unconference session yesterday morning:

I have received three push backs on this suggestion:

  1. Giving up these features would result in an AI that is less likely to be able to solve humanity’s most pressing problems (cancer, aging, accelerating climate change, etc)
  2. It will in any case be impossible to omit these features, since they will emerge automatically from simpler features of advanced AI models
  3. It will be unethical for us not to create such AIs, as that would deny them sentience.

All three push backs deserve considerable thought. But for now, I’ll focus on the third.

In my lead-in, I mentioned a sleight of hand. Here it is.

It starts with the observation that if a sentient AI existed, it would be unethical for us to keep it as a kind of “slave” (or “tool”) in a restricted environment.

Then it moves, unjustifiably, to the conclusion that if a non-sentient AI existed, kept in a restricted environment, and we prevented that AI from a redesign that would give it sentience, that would be unethical too.

Most people will agree with the premise, but the conclusion does not follow.

The sleight of hand is similar to one for which advocates of the philosophical position known as longtermism have (rightly) been criticised.

That sleight of hand moves from “we have moral obligations to people who live in different places from us” to “we have moral obligations to people who live in different times from us”.

That extension of our moral concern makes sense for people who already exist. But it does not follow that I should prioritise changing my course of actions, today in 2024, purely in order to boost the likelihood of huge numbers of more people being born in (say) the year 3024, once humanity (and transhumanity) has spread far beyond earth into space. The needs of potential gazillions of as-yet-unborn (and as-yet-unconceived) sentients in the far future do not outweigh the needs of the sentients who already exist.

To conclude: we humans have no moral obligation to bring into existence sentients that have not yet been conceived.

Bringing various sentients into existence is a potential choice that we could make, after carefully weighing up the pros and cons. But there is no special moral dimension to that choice which outranks an existing pressing concern, namely the desire to keep humanity safe from catastrophic harm from forthcoming super-powerful advanced AIs with flaws in their design, specification, configuration, implementation, security, or volition.

So, I will continue to advocate for more attention to Adv AI- (as well as for more attention to Adv AI+).

29 February 2024

The conversation continues: Reducing risks of AI catastrophe

Filed under: AGI, risks — Tags: , , — David Wood @ 4:36 am

I wasn’t expecting to return to this topic quite so quickly.

When the announcement was made on the afternoon of the second full day of the Beneficial General Intelligence summit about the subjects for the “Interactive Working Group” round tables, I was expecting that a new set of topics would be proposed, different to those of the first afternoon. However, the announcement was simple: it would be the same topics again.

This time, it was a different set of people who gathered at this table – six new attendees, plus two of us – Roman Yampolskiy and myself – who had taken part in the first discussion.

(My notes from that first discussion are here, but you should be able to make sense of the following comments even if you haven’t read those previous notes.)

The second conversation largely went in a different direction to what had been discussed the previous afternoon. Here’s my attempt at a summary.

1. Why would a superintelligent AI want to kill large numbers of humans?

First things first. Set aside for the moment any thoughts of trying to control a superintelligent AI. Why would such an AI need to be controlled? Why would such an AI consider inflicting catastrophic harm on a large segment of humanity?

One answer is that an AI that is trained by studying human history will find lots of examples of groups of humans inflicting catastrophic harm on each other. An AI that bases its own behaviour on what it infers from human history might decide to replicate that kind of behaviour – though with more deadly impact (as the great intelligence it possesses will give it more ways to carry out its plans).

A counter to that line of thinking is that a superintelligent AI will surely recognise that such actions are contrary to humanity’s general expressions of moral code. Just because humans have behaved in a particularly foul way, from time to time, it does not follow that a superintelligent AI will feel that it ought to behave in a similar way.

At this point, a different reason becomes important. It is that the AI may decide that it is in its own rational self-interest to seriously degrade the capabilities of humans. Otherwise, humans may initiate actions that would pose an existential threat to the AI:

  • Humans might try to switch off the AI, for any of a number of reasons
  • Humans might create a different kind of superintelligent AI that would pose a threat to the first one.

That’s the background to a suggestion that was made during the round table: humans should provide the AI with cast-iron safety guarantees that they will never take actions that would jeopardise the existence of the AI.

For example (and this is contrary to what humans often propose), no remote tamperproof switch-off mechanism should ever be installed in that AI.

Because of these guarantees, the AI will lose any rationale for killing large numbers of humans, right?

However, given the evident fickleness and unreliability of human guarantees throughout history, why would an AI feel justified in trusting such guarantees?

Worse, there could be many other reasons for an AI to decide to kill humans.

The analogy is that humans have lots of different reasons why they kill various animals:

  1. They fear that the animal may attack and kill them
  2. They wish to eat the animal
  3. They wish to use parts of the animal’s body for clothing or footwear
  4. They wish to reduce the population of the animals in question, for ecological management purposes
  5. They regard killing the animal as being part of a sport
  6. They simply want to use for another purpose the land presently occupied by the animal, and they cannot be bothered to relocate the animal elsewhere.

Even if an animal (assuming it could speak) promises to humans that it will not attack and kill them – the analogy of the safety guarantees proposed earlier – that still leaves lots of reasons why the animal might suffer a catastrophic fate at the hands of humans.

So also for the potential fate of humans at the hands of an AI.

2. Rely on an objective ethics?

Continuing the above line of thought, shouldn’t a superintelligent AI work out for itself that it would be ethically wrong for it to cause catastrophic harm to humans?

Consider what has been called “the expansion of humanity’s moral circle” over the decades (this idea has been discussed by Jacy Reese Anthis among others). That circle of concern has expanded to include people from different classes, races, and genders; more recently, greater numbers of animal species are being included in this circle of concern.

Therefore, shouldn’t we expect that a superintelligent AI will place humans within the circle of creatures where the AI has an moral concern?

However, this view assumes a central role for humans in any moral calculus. It’s possible that a superintelligent AI may use a different set of fundamental principles. For example, it may prioritise much greater biodiversity on earth, and would therefore drastically reduce the extent of human occupation of the planet.

Moreover, this view assumes giving primacy for moral calculations within the overall decision-making processes followed by the AI. Instead, the AI may reason to itself:

  • According to various moral considerations, humans should suffer no catastrophic harms
  • But according to some trans-moral considerations, a different course of action is needed, in which humans would suffer that harm as a side-effect
  • The trans-moral considerations take priority, therefore it’s goodbye to humanity

You may ask: what on earth is a trans-moral consideration? The answer is that the concept is hypothetical, and represents any unknown feature that emerges in the mind of the superintelligent AI.

It is, therefore, fraught with danger to assume that the AI will automatically follow an ethical code that prioritises human flourishing.

3. Develop an AI that is not only superintelligent but also superwise?

Again staying with this line of thought, how about ensuring that human-friendly moral considerations are deeply hard-wired into the AI that is created?

We might call such an AI not just “superintelligent” but “superwise”.

Another alternative name would be “supercompassionate”.

This innate programming would avoid the risk that the AI would develop a different moral (or trans-moral) system via its own independent thinking.

However, how can we be sure that the moral programming will actually stick?

The AI may observe that the principles we have tried to program into it are contradictory, or are in violation with fundamental physical reality, in ways that humans had not anticipated.

To resolve that contradiction, the AI may jettison some or all of the moral code we tried to place into it.

We might try to address this possibility by including simpler, clearer instructions, such as “do not kill” and “always tell the truth”.

However, as works of fiction have frequently pointed out, simple-sounding moral laws are subject to all sorts of ambiguity and potential misunderstanding. (The writer Darren McKee provides an excellent discussion of this complication in his recent book Uncontrollable.)

That’s not to say this particular project is doomed. But it does indicate that a great deal of work remains to be done, in order to define and then guarantee “superwise” behaviours.

Moreover, even if some superintelligent AIs are created to be superwise, risks of catastrophic human harms will still arise from any non-superwise superintelligent AIs that other developers create.

4. Will a diverse collection of superintelligent AIs constrain each other?

If a number of different superintelligent AIs are created, what kind of coexistence is likely to arise?

One idea, championed by David Brin, is that the community of such AIs will adopt the practices of mutual monitoring and reciprocal accountability.

After all, that’s what happens among humans. We keep each other’s excesses in check. A human who disregards these social obligations may gain a temporary benefit, but will suffer exclusion sooner or later.

In this thinking, rather than just creating a “singleton” AI superintelligence, we humans should create a diverse collection of such beings. These beings will soon develop a system of mutual checks and balances.

However, that’s a different assumption from the one mentioned in the previous section, in which catastrophic harm may still befall humans, when the existence of a superwise AI is insufficient to constrain the short-term actions of a non-superwise AI.

For another historical analysis, consider what happened to the native peoples of North America when their continent was occupied not just by one European colonial power but by several competing such powers. Did the multiplicity of superpowerful colonial powers deter these different powers from inflicting huge casualties (intentionally and unintentionally) on the native peoples? Far from it.

In any case, a system of checks and balances relies on a rough equality in power between the different participants. That was the case during some periods in human history, but by no means always. And when we consider different superintelligent AIs, we have to bear in mind that the capabilities of any one of these might suddenly catapult forward, putting it temporarily into a league of its own. For that brief moment in time, it would be rationally enlightened for that AI to destroy or dismantle its potential competitors. In other words, the system would be profoundly unstable.

5. Might superintelligent AIs decide to leave humans alone?

(This part of the second discussion echoed what I documented as item 9 for the discussion on the previous afternoon.)

Once superintelligent AIs are created, they are likely to self-improve quickly, and they may soon decide that a better place for them to exist is somewhere far from the earth. That is, as in the conclusion of the film Her, the AIs might depart into outer space, or into some kind of inner space.

However, before they depart, they may still inflict damage on humans,

  • Perhaps to prevent us from interfering with whatever system supports their inner space existence
  • Perhaps because they decide to use large parts of the earth to propel themselves to wherever they want to go.

Moreover, given that they might evolve in ways that we cannot predict, it’s possible that at least some of the resulting new AIs will choose to stay on earth for a while longer, posing the same set of threats to humans as is covered in all the other parts of this discussion.

6. Avoid creating superintelligent AI?

(This part of the second discussion echoed what I documented as item 4 for the discussion on the previous afternoon.)

More careful analysis may determine a number of features of superintelligent AI that pose particular risks to humanity – risks that are considerably larger than those posed by existing narrow AI systems.

For example, it may be that it is general reasoning capability that pushes AI over the line from “sometimes dangerous” to “sometimes catastrophically dangerous”.

In that case, the proposal is:

  • Avoid these features in the design of new generations of AI
  • Avoid including any features into new generations of AI from which these particularly dangerous features might evolve or emerge

AIs that have these restrictions may nevertheless still be especially useful for humanity, delivering sustainable superabundance, including solutions to diseases, aging, economic deprivation, and exponential climate change.

However, even though some development organisations may observe and enforce these restrictions, it is likely that other organisations will break the rules – if not straightaway, then within a few years (or decades at the most). The attractions of more capable AIs will be too tempting to resist.

7. Changing attitudes around the world?

To take stock of the discussion so far (in both of the two roundtable session on the subject):

  1. A number of potential solutions have been identified, that could reduce the risks of catastrophic harm
  2. This includes just building narrow AI, or building AI that is not only superintelligent but also superwise
  3. However, enforcing these design decisions on all AI developers around the world seems an impossible task
  4. Given the vast power of the AI that will be created, it just takes one rogue actor to imperil the entire human civilisation.

The next few sections consider various ways to make progress with point 3 in that list.

The first idea is to spread clearer information around the world about the scale of the risks associated with more powerful AI. An education programme is needed such as the world has never seen before.

Good films and other media will help with this educational programme – although bad films and other media will set it back.

Examples of good media include the Slaughterbots videos made by FLI, and the film Ex Machina (which packs a bigger punch on a second viewing than on the first viewing).

As another comparison, consider also the 1983 film The Day After which transformed public opinion about the dangers of a nuclear war.

However, many people are notoriously resistant to having their minds changed. The public reaction to the film Don’t Look Up is an example: many people continue to pay little attention to the risks of accelerating climate change, despite the powerful message of that film.

Especially when someone’s livelihood, or their sense of identity or tribal affiliation, is tied up with a particular ideological commitment, they are frequently highly resistant to changing their minds.

8. Changing mental dispositions around the world?

This idea might be the craziest on the entire list, but, to speak frankly, it seems we need to look for and embrace ideas which we would previously have dismissed as crazy.

The idea is to seek to change, not only people’s understanding of the facts of AI risk, but also their mental dispositions.

Rather than accepting the mix of anger, partisanship, pride, self-righteousness, egotism, vengefulness, deceitfulness, and so on, that we have inherited from our long evolutionary background, how about using special methods to transform our mental dispositions?

Methods are already known which can lead people into psychological transformation, embracing compassion, humility, kindness, appreciation, and so on. These methods include various drugs, supplements, meditative practices, and support from electronic and computer technologies.

Some of these methods have been discussed for millennia, whereas others have only recently become possible. The scientific understanding of these methods is still at an early stage, but it arguably deserves much more focus. Progress in recent years has been disappointingly slow at times (witness the unfounded hopes in this forward looking article of mine from 2013), but that pattern is common for breakthroughs in technology and/or therapies which can move from disappointingly slow to shockingly fast.

The idea is that these transformational methods will improve the mental qualities of people all around the world, allowing us all to transcend our previous perverse habit of believing only the things that are appealing to our psychological weaknesses. We’ll end up with better voters and (hence) better politicians – as well as better researchers, better business leaders, better filmmakers, and better developers and deployers of AI solutions.

It’s a tough ask, but it may well be the right ask at this crucial moment in cosmic history.

9. Belt and braces: monitoring and sanctions?

Relying on people around the world changing their mental outlooks for the better – and not backtracking or relapsing into former destructive tendencies – probably sounds like an outrageously naïve proposal.

Such an assessment would be correct – unless the proposal is paired with a system of monitoring and compliance.

Knowing that they are being monitored can be a useful aid to encouraging people to behave better.

That encouragement will be strengthened by the knowledge that non-compliance will result in an escalating series of economic sanctions, enforced by a growing alliance of nations.

For further discussion of the feasibility of systems of monitoring and compliance, see scenario 4, “The narrow corridor: Striking and keeping the right balance”, in my article “Four scenarios for the transition to AGI”.

10. A better understanding of what needs to be changed?

One complication in this whole field is that the risks of AI cannot be managed in isolation from other dangerous trends. We’re not just living in a time of growing crisis; we’re living in what has been called a “polycrisis”:

Cascading and connected crises… a cluster of related global risks with compounding effects, such that the overall impact exceeds the sum of each part.

For one analysis of the overlapping set of what I have called “landmines”, see this video.

From one point of view, this insight complicates the whole situation with AI catastrophic risk.

But it is also possible that the insight could lead to a clearer understanding of a “critical choke point” where, if suitable pressure is applied, the whole network of cascading risks is made safer.

This requires a different kind of thinking: systems thinking.

And it will also require us to develop better analysis tools to map and understand the overall system.

These tools would be a form of AI. Created with care (so that their output can be verified and then trusted), such tools would make a vital difference to our ability to identify the right choke point(s) and to apply suitable pressure.

These choke points may turn out to be ideas already covered above: a sustained new educational programme, coupled with an initiative to assist all of us to become more compassionate. Or perhaps something else will turn out to be more critical.

We won’t know, until we have done the analysis more carefully.

28 February 2024

Notes from BGI24: Reducing risks of AI catastrophe

Filed under: AGI, risks — Tags: — David Wood @ 1:12 pm

The final session on the first full day of BGI24 (yesterday) involved a number of round table discussions described as “Interactive Working Groups”.

The one in which I participated looked at possibilities to reduce the risks of AI inducing a catastrophe – where “catastrophe” means the death (or worse!) of large portions of the human population.

Around twelve of us took part, in what was frequently an intense but good-humoured conversation.

The risks we sought to find ways to reduce included:

  • AI taking and executing decisions contrary to human wellbeing
  • AI being directed by humans who have malign motivations
  • AI causing a catastrophe as a result of an internal defect

Not one of the participants in this conversation thought there was any straightforward way to guarantee the permanent reduction of such risks.

We each raised possible approaches (sometimes as thought experiments rather than serious proposals), but in every case, others in the group pointed out fundamental shortcomings of these approaches.

By the end of the session, when the BGI24 organisers suggested the round table conversations should close and people ought to relocate for a drinks reception one floor below, the mood in our group was pretty despondent.

Nevertheless, we agreed that the search should continue for a clearer understanding of possible solutions.

That search is likely to resume as part of the unconference portion of the summit later this week.

A solution – if one exists – is likely to involve a number of different mechanisms, rather than just a single action. These different mechanisms may incorporate refinements of some of the ideas we discussed at our round table.

With the vision that some readers of this blogpost will be able to propose refinements worth investigating, I list below some of what I remember from our round table conversation.

(The following has been written in some haste. I apologise for typos, misunderstandings, and language that is unnecessarily complicated.)

1. Gaining time via restricting access to compute hardware

Some of the AI catastrophic risks can be delayed if it is made more difficult for development teams around the world to access the hardware resources needed to train next generation AI models. For example, teams might be required to obtain a special licence before being able to purchase large quantities of cutting edge hardware.

However, as time passes, it will become easier for such teams to gain access to the hardware resources required to create powerful new generations of AI. That’s because

  1. New designs or algorithms will likely allow powerful AI to be created using less hardware than is currently required
  2. Hardware with the requisite power is likely to become increasingly easy to manufacture (as a consequence of, for example, Moore’s Law).

In other words, this approach may reduce the risks of AI catastrophe over the next few years, but it cannot be a comprehensive solution for the longer term.

(But the time gained ought in principle to provide a larger breathing space to devise and explore other possible solutions.)

2. Avoiding an AI having agency

An AI that lacks agency, but is instead just a passive tool, may have less inclination to take and execute actions contrary to human intent.

That may be an argument to research topics such as AI consciousness and AI volition, in order that any AIs created would be pure passive tools.

(Note that such AIs might plausibly still display remarkable creativity and independence of thought, so they would still provide many of the benefits anticipated for advanced AIs.)

Another idea is to avoid the AI having the kind of persistent memory that might lead to the AI gaining a sense of personal identity worth protecting.

However, it is trivially easy for someone to convert a passive AI into a larger system that demonstrates agency.

That could involve two AIs joined together; or (more simply) a human that uses an AI as a tool to achieve their own goals.

Another issue with this approach is that an AI designed to be passive might manifest agency as an unexpected emergent property. That’s because of two areas in which our understanding is currently far from complete:

  1. The way in which agency arises in biological brains
  2. The way in which deep neural networks reach their conclusions.

3. Verify AI recommendations before allowing them to act in the real world

This idea is a variant of the previous one. Rather than an AI issuing its recommendations as direct actions on the external world, the AI is operated entirely within an isolated virtual environment.

In this idea, the operation of the AI is carefully studied – ideally taking advantage of analytical tools that identify key aspects of the AI’s internal models – so that the safety of its recommendations can be ascertained. Only at that point are these recommendations actually put into practice.

However, even if we understand how an AI has obtained its results, it can remain unclear whether these results will turn out to aid human flourishing, or instead have catastrophic consequences. Humans who are performing these checks may reach an incorrect conclusion. For example, they may not spot that the AI has made an error in a particular case.

Moreover, even if some AIs are operated in the above manner, other developers may create AIs which, instead, act directly on the real-world. They might believe they are gaining a speed advantage by doing so. In other words, this risk exists as soon as an AI is created outside of the proposed restrictions.

4. Rather than general AI, just develop narrow AI

Regarding risks of catastrophe from AI that arise from AIs reaching the level of AGI (Artificial General Intelligence) or beyond (“superintelligence”), how about restricting AI development to narrow intelligence?

After all, AIs with narrow intelligence can already provide remarkable benefits to humanity, such as the AlphaFold system of DeepMind which has transformed the study of protein interactions, and the AIs created by Insilico Medicine to speed up drug discovery and deployment.

However, AIs with narrow intelligence have already been involved in numerous instances of failure, leading to deaths of hundreds (or in some estimates, thousands) of people.

As narrow intelligence gains in power, it can be expected that the scale of associated disasters is likely to increase, even if the AI remains short of the status of AGI.

Moreover, it may happen that an AI that is expected to remain at the level of narrow intelligence unexpectedly makes the jump to AGI. After all, which kinds of changes need to be made to a narrow AI to convert it to AGI, is still a controversial question.

Finally, even if many AIs are restricted to the level of narrow intelligence, other developers may design and deploy AGIs. They might believe they are gaining a strong competitive advantage by doing so.

5. AIs should check with humans in all cases of uncertainty

This idea is due to Professor Stuart Russell. It is that AIs should always check with humans in any case where there is uncertainty whether humans would approve of an action.

That is, rather than an AI taking actions in pursuit of a pre-assigned goal, the AI has a fundamental drive to determine which actions will meet with human approval.

However, An AI which needs to check with humans ever time it has reached a conclusion will be unable to operate in real-time. The speed at which it operates will be determined by how closely humans are paying attention. Other developers will likely seek to gain a competitive advantage by reducing the number of times humans are asked to provide feedback.

Moreover, different human observers may provide the AI with different feedback. Psychopathic human observers may steer such an AI toward outcomes that are catastrophic for large portions of the population.

6. Protect critical civilisational infrastructure

Rather than applying checks over the output of an AI, how about applying checks on input to any vulnerable parts of our civilisational infrastructure? These include the control systems for nuclear weapons, manufacturing facilities that could generate biological pathogens, and so on.

This idea – championed by Steve Omohundro and Max Tegmark – seeks to solve the problem of “what if someone creates an AI outside of the allowed design?” In this idea, the design and implementation of the AI does not matter. That’s because access to critical civilisational infrastructure is protected against any unsafe access.

(Significantly, these checks protect that infrastructure against flawed human access as well as against flawed AI access.)

The protection relies on tamperproof hardware running secure trusted algorithms that demand to see a proof of the safety of an action before that action is permitted.

It’s an interesting research proposal!

However, the idea relies on us humans being able to identify in advance all the ways in which an AI (with or without some assistance and prompting by a flawed human) could identify that would cause a catastrophe. An AI that is more intelligent than us is likely to find new such methods.

For example, we could put blocks on all existing factories where dangerous biopathogens could be manufactured. But an AI could design and create a new way to create such a pathogen, involving materials and processes that were previously (wrongly) considered to be inherently safe.

7. Take prompt action when dangerous actions are detected

The way we guard against catastrophic actions initiated by humans can be broken down as follows:

  1. Make a map of all significant threats and vulnerabilities
  2. Prioritise these vulnerabilities according to perceived likelihood and impact
  3. Design monitoring processes regarding these vulnerabilities (sometimes called “canary signals”)
  4. Take prompt action in any case when imminent danger is detected.

How about applying the same method to potential damage involving AI?

However, AIs may be much more powerful and elusive than even the most dangerous of humans. Taking “prompt action” against such an AI may be outside of our capabilities.

Moreover, an AI may deliberately disguise its motivations, deceiving humans (like how some Large Language Models have already done), until it is too late for humans to take appropriate protective action.

(This is sometimes called the “treacherous turn” scenario.)

Finally, as in the previous idea, the process is vulnerable to failure because we humans failed to anticipate all the ways in which an AI might decide to act that would have catastrophically harmful consequences for humans.

8. Anticipate mutual support

The next idea takes a different kind of approach. Rather than seeking to control an AI that is much smarter and more powerful than us, won’t it simply be sufficient to anticipate that these AIs will find some value or benefit from keeping us around?

This is like humans who enjoy having pet dogs, despite these dogs not being as intelligent as us.

For example, AIs might find us funny or quaint in important ways. Or they may need us to handle tasks that they cannot do by themselves.

However, AIs that are truly more capable than humans in every cognitive aspect will be able, if they wish, to create simulations of human-like creatures that are even funnier and quainter than us, but without our current negative aspects.

As for AIs still needing some support from humans for tasks they cannot currently accomplish by themselves, such need is likely to be at best a temporary phase, as AIs quickly self-improve far beyond our levels.

It would be like ants expecting humans to take care of them, since the ants expect we will value their wonderful “antness”. It’s true: humans may decide to keep a small number of ants in existence, for various reasons, but most humans would give little thought to actions that had positive outcomes overall for humans (such as building a new fun theme park) at the cost of extinguishing all the ants in that area.

9. Anticipate benign neglect

Given that humans won’t have any features that will be critically important to the wellbeing of future AIs, how about instead anticipating a “benign neglect” from these AIs.

It would be like the conclusion of the movie Her, in which (spoiler alert!) the AIs depart somewhere else in the universe, leaving humans to continue to exist without interacting with them.

After all, the universe is a huge place, with plenty of opportunity for humans and AIs each to expand their spheres of occupation, without getting in each other’s way.

However, AIs may well find the Earth to be a particularly attractive location from which to base their operations. And they may perceive humans to be a latent threat to them, because:

  1. Humans might try, in the future, to pull the plug on (particular classes of ) AIs, terminating all of them
  2. Humans might create a new type of AI, that would wipe out the first type of AI.

To guard against the possibility of such actions by humans, the AIs are likely to impose (at the very least) significant constraints on human actions.

Actually, that might not be so bad an outcome. However, what’s just been described is by no means an assured outcome. AIs may soon develop entirely alien ethical frameworks which have no compunction in destroying all humans. For example, AIs may be able to operate more effectively, for their own purposes, if the atmosphere of the earth is radically transformed, similar to the transformation in deep past from an atmosphere dominated by methane to one containing large quantities of oxygen.

In short, this solution relies in effect on tossing a dice, with unknown odds for the different outcomes.

10. Maximal surveillance

Where many of the above ideas fail is because of the possibility of rogue actors designing or operating AIs that are outside of what has otherwise been agreed to be safe parameters.

So, how about stepping up worldwide surveillance mechanisms, to detect any such rogue activity?

That’s similar to how careful monitoring already takes place on the spread of materials that could be used to create nuclear weapons. The difference, however, is that there are (or may soon be) many more ways to create powerful AIs than to create catastrophically powerful nuclear weapons. So the level of surveillance would need to be much more pervasive.

That would involve considerable intrusions on everyone’s personal privacy. However, that’s an outcome that may be regarded as “less terrible” than AIs being able to inflict catastrophic harm on humanity.

However, what would be needed, in such a system, would be more than just surveillance. The idea also requires the ability for the world as a whole to take decisive action against any rogue action that has been observed.

This may appear to require, however, what would be a draconian world government, that many critics would regard as being equally terrible as the threat of AI failure that is is supposed to be addressing.

On account of (understandable) aversion to the threat of a draconian government, many people will reject this whole idea. It’s too intrusive, they will say. And, by the way, due to governmental incompetence, it’s likely to fail even on its own objectives.

11. Encourage an awareness of personal self-interest

Another way to try to rein back the activities of so-called rogue actors – including the leaders of hostile states, terrorist organisations, and psychotic billionaires – is to appeal to their enlightened self-interest.

We may reason with them: you are trying to gain some advantage from developing or deploying particular kinds of AI. But here are reasons why such an AI might get out of your control, and take actions that you will subsequently regret. Like killing you and everyone you love.

This is not an appeal to these actors to stop being rogues, for the sake of humanity or universal values or whatever. It’s an appeal to their own more basic needs and desires.

There’s no point in creating an AI that will result in you becoming fabulously wealthy, we will argue, if you are killed shortly after becoming so wealthy.

However, this depends on all these rogues observing at least some of level of rational thinking. On the contrary, some rogues appear to be batsh*t crazy. Sure, they may say, there’s a risk of the world being destroyed But that’s a risk they’re willing to take. They somehow believe in their own invincibility.

12. Hope for a profound near-miss disaster

If rational arguments aren’t enough to refocus everyone’s thinking, perhaps what’s needed is a near-miss catastrophic disaster.

Just as Fukushima and Chernobyl changed public perceptions (arguably in the wrong direction – though that’s an argument for another day) about the wisdom of nuclear power stations, a similar crisis involving AI might cause the public to waken up and demand more decisive action.

Consider AI versions of the 9/11 atrocity, the Union Carbide Bhopal explosion, the BP Deepwater Horizon disaster, the NASA Challenger and Columbia shuttle tragedies, a global pandemic resulting (perhaps) from a lab leak, and the mushroom clouds over Hiroshima and Nagasaki.

That should waken people up, and put us all onto an appropriate “crisis mentality”, so that we set aside distractions, right?

However, humans have funny ways of responding to near miss disasters. “We are a lucky species” may be one retort – “see, we are still here”. Another issue is a demand for “something to be done” could have all kinds of bad consequences in its own right, if no good measures have already been thought through and prepared.

Finally, if we somehow hope for a bad mini-disaster, to rouse public engagement, we might find that the mini-disaster expands far beyond the scale we had in mind. The scale of the disaster could be worldwide. And that would be the end of that. Oops.

That’s why a fictional (but credible) depiction of a catastrophe is far preferable to any actual catastrophe. Consider, as perhaps the best example, the remarkable 1983 movie The Day After.

13. Using AI to generate potential new ideas

One final idea is that narrow AI may well help us explore this space of ideas in ways that are more productive.

It’s true that we will need to be on our guard against any deceptive narrow AIs that are motivated to deceive us into adopting a “solution” that has intrinsic flaws. But if we restrict the use of narrow AIs in this project to ones whose operation we are confident that we fully understand, that risk is mitigated.

However – actually, there is no however in this case! Except that we humans need to be sure that we will apply our own intense critical analysis to any proposals arising from such an exercise.

Endnote: future politics

I anticipated some of the above discussion in a blogpost I wrote in October, Unblocking the AI safety conversation logjam.

In that article, I described the key component that I believe is necessary to reduce the global risks of AI-induced catastrophe: a growing awareness and understanding of the positive transformational possibility of “future politics” (I have previously used the term “superdemocracy” for the same concept).

Let me know what you think about it!

And for further discussion of the spectrum of options we can and should consider, start here, and keep following the links into deeper analysis.

26 February 2024

How careful do AI safety red teams need to be?

In my quest to catalyse a more productive conversation about the future of AI, I’m keen to raise “transcendent questions” – questions that can help all of us to rise above the familiar beaten track of the positions we reflexively support and the positions we reflexively oppose.

I described “transcendent questions” in a previous article of mine in Mindplex Magazine:

Transcendent Questions On The Future Of AI: New Starting Points For Breaking The Logjam Of AI Tribal Thinking

These questions are potential starting points for meaningful non-tribal open discussions. These questions have the ability to trigger a suspension of ideology.

Well, now that I’ve arrived at the BGI24 conference, I’d like to share another potential transcendent question. It’s on the subject of limits on what AI safety red teams can do.

The following chain of thought was stimulated by me reading in Roman Yampolskiy’s new book “AI: Unexplainable, Unpredictable, Uncontrollable” about AI that is “malevolent by design”.

My first thought on coming across that phrase was, surely everyone will agree that the creation of “malevolent by design” AI is a bad idea. But then I realised that – as is so often the case in contemplating the future of advanced AI – things may be more complicated. And that’s where red teams come into the picture.

Here’s a definition of a “red team” from Wikipedia:

A red team is a group that pretends to be an enemy, attempts a physical or digital intrusion against an organization at the direction of that organization, then reports back so that the organization can improve their defenses. Red teams work for the organization or are hired by the organization. Their work is legal, but can surprise some employees who may not know that red teaming is occurring, or who may be deceived by the red team.

The idea is well-known. In my days in the mobile computing industry at Psion and Symbian, ad hoc or informal red teams often operated, to try to find flaws in our products before these products were released into the hands of partners and customers.

Google have written about their own “AI Red Team: the ethical hackers making AI safer”:

Google Red Team consists of a team of hackers that simulate a variety of adversaries, ranging from nation states and well-known Advanced Persistent Threat (APT) groups to hacktivists, individual criminals or even malicious insiders. The term came from the military, and described activities where a designated team would play an adversarial role (the “Red Team”) against the “home” team.

As Google point out, a red team is more effective if it takes advantage of knowledge about potential security issues and attack vectors:

Over the past decade, we’ve evolved our approach to translate the concept of red teaming to the latest innovations in technology, including AI. The AI Red Team is closely aligned with traditional red teams, but also has the necessary AI subject matter expertise to carry out complex technical attacks on AI systems. To ensure that they are simulating realistic adversary activities, our team leverages the latest insights from world class Google Threat Intelligence teams like Mandiant and the Threat Analysis Group (TAG), content abuse red teaming in Trust & Safety, and research into the latest attacks from Google DeepMind.

Here’s my first question – a gentle warm-up question. Do people agree that companies and organisations that develop advanced AI systems should use something like red teams to test their own products before they are released?

But the next question is the one I wish to highlight. What limits (if any) should be put on what a red team can do?

The concern is that a piece of test malware may in some cases turn out to be more dangerous than the red team foresaw.

For example, rather than just probing the limits of an isolated AI system in a pre-release environment, could test malware inadvertently tunnel its way out of its supposed bounding perimeter, and cause havoc more widely?

Oops. We didn’t intend our test malware to be that clever.

If that sounds hypothetical, consider the analogous question about gain-of-function research with biological pathogens. In that research, pathogens are given extra capabilities, in order to assess whether potential counter-measures could be applied quickly enough if a similar pathogen were to arise naturally. However, what if these specially engineered test pathogens somehow leak from laboratory isolation into the wider world? Understandably, that possibility has received considerable attention. Indeed, as Wikipedia reports, the United States imposed a three-year long moratorium on gain-of-function research from 2014 to 2017:

From 2014 to 2017, the White House Office of Science and Technology Policy and the Department of Health and Human Services instituted a gain-of-function research moratorium and funding pause on any dual-use research into specific pandemic-potential pathogens (influenza, MERS, and SARS) while the regulatory environment and review process were reconsidered and overhauled. Under the moratorium, any laboratory who conducted such research would put their future funding (for any project, not just the indicated pathogens) in jeopardy. The NIH has said 18 studies were affected by the moratorium.

The moratorium was a response to laboratory biosecurity incidents that occurred in 2014, including not properly inactivating anthrax samples, the discovery of unlogged smallpox samples, and injecting a chicken with the wrong strain of influenza. These incidents were not related to gain-of-function research. One of the goals of the moratorium was to reduce the handling of dangerous pathogens by all laboratories until safety procedures were evaluated and improved.

Subsequently, symposia and expert panels were convened by the National Science Advisory Board for Biosecurity (NSABB) and National Research Council (NRC). In May 2016, the NSABB published “Recommendations for the Evaluation and Oversight of Proposed Gain-of-Function Research”. On 9 January 2017, the HHS published the “Recommended Policy Guidance for Departmental Development of Review Mechanisms for Potential Pandemic Pathogen Care and Oversight” (P3CO). This report sets out how “pandemic potential pathogens” should be regulated, funded, stored, and researched to minimize threats to public health and safety.

On 19 December 2017, the NIH lifted the moratorium because gain-of-function research was deemed “important in helping us identify, understand, and develop strategies and effective countermeasures against rapidly evolving pathogens that pose a threat to public health.”

As for potential accidental leaks of biological pathogens engineered with extra capabilities, so also for potential accidental leaks of AI malware engineered with extra capabilities. In both cases, unforeseen circumstances could lead to these extra capabilities running amok in the wider world.

Especially in the case of AI systems which are already only incompletely understood, and where new properties appear to emerge in new circumstances, who can be sure what outcomes may arise?

One counter is that the red teams will surely be careful in the policing of the perimeters they set up to confine their tests. But can we be sure they have thought through every possibility? Or maybe a simple careless press of the wrong button – a mistyped parameter or an incomplete prompt – would temporary open a hole in the perimeter. The test AI malware would be jail-broken, and would now be real-world AI malware – potentially evading all attempts to track it and shut it down.

Oops.

My final question (for now) is: if it is agreed that constraints should be applied on how red teams operate, how will these constraints be overseen?

Postscript – for some additional scenarios involving the future of AI safety, take a look at my article “Cautionary Tales And A Ray Of Hope”.

Blog at WordPress.com.