Summary of AXRP's Mechanistic Anomaly Detection with Mark Xu

Alex McKenzie published on
5 min, 853 words

This post is a summary of the first part of the AXRP podcast episode "Mechanistic Anomaly Detection with Mark Xu".

I think this is probably a pretty important topic, since

  1. It combines the widely-studied mechanistic interpretability field with a prima facie coherent plan to mitigate harm from AI, and
  2. It's the current research agenda of Paul Christiano, who is well respected in the AI Safety field.

The setting

Suppose we have trained an AI, and during training it always did the right thing, and acted in alignment with human interests. What we want to know is: can we make sure that it will continue to do so once we've deployed it?

Actually, that's a really hard problem, so let's try solve something simpler: how can we monitor the AI and stop it if it shows signs of misalignment? A first pass might be just to monitor the behaviour of the AI, and stop it if it starts behaving differently to during training. However, a sufficiently advanced AI can definitely fool us with its behaviour.

Instead, we should monitor the reasoning of the AI, and if its reasoning is sufficiently different to anything that happened during training - that is, if the reasoning is anomalous - then we should shut it down, and re-train it on the examples that it misbehaved with. Unlike with behaviour monitoring, if we're careful about how we implement the reasoning monitoring, it is impossible for us to be fooled because we're not looking at what the AI wants us to see, we're looking at how the AI is actually making decisions behind the scenes.

What does mechanistic anomaly detection look like?

There are 2 components of a mechanistic anomaly detection system: the "heuristic argument" producer and the anomaly detector running on top of that.

The heuristic argument producer takes as inputs the activities of the AI's neurons [^1] and the behaviour of the AI and outputs a "heuristic argument" explaining why the AI took that action. It's important that this heuristic argument is mechanistic in the sense of mechanistic interpretability, i.e. that it's explaining the actual workings of the neural network, and not just a plausible way the network could have arrived at this behaviour.

It's a "heuristic" argument because it's not meant to be airtight like a proof: though it should be faithful to the mechanisms generating the behaviour, it shouldn't have a 1-1 correspondence to the mechanisms. Such detail isn't necessary for this purpose.

Next, there's the anomaly detector, which looks at this heuristic argument and tries to figure out if it is unlikely given the distribution of heuristic arguments in the training set.

Limitations and potential problems

There are 2 families of objections to the idea of mechanistic anomaly detection that I understood from the podcast.

Objection 1: argument resolution

Firstly, it's possible that an AI is using a mechanism that would output different behaviour during training and deployment. For a crude example, consider "if I'm being trained, behave nicely; otherwise, behave badly". How would the mechanistic anomaly detection detect this as an anomaly during deployment, if the reasoning is the same as during training?

The answer seems to be that the heuristic argument produced by the heuristic-argument-producer component should be high enough resolution to distinguish between these two cases. In the crude example, the heuristic argument might look something like:

I'm running the program IF training THEN behave nicely ELSE behave badly with inputs training=TRUE.

The hope is that changing training=TRUE to training=FALSE would trigger an alert from the anomaly detector, since every instance of this type of heuristic argument during training had training=TRUE.

Of course, it's important that the heuristic algorithms aren't so high-resolution that even trivial details trigger the anomaly detector. Getting this balance right seems likely to be difficult in practice.

Objection 2: out-of-distribution events

Secondly, imagine we trained an AI to behave well on some training distribution. But then, during deployment, the AI encounters some situation that it's never seen before. Won't its reasoning also be unprecedented, and hence anomalous?

In short, the answer to this objection is that mechanistic anomaly detection is simply an inappropriate technique for this situation. To better understand why, let's think about "in-distribution" vs "out-of-distribution" anomalies. In-distribution anomalies are anomalies arising from situations that are within the distribution of the training data.

For in-distribution anomalies, because we know the size of the training set and that no misaligned reasoning showed up during training, we can set probabilistic bounds on how many false positives the anomaly detector will alert on, provided that we're catching genuinely misaligned reasoning.

(How can we be sure that the AI had the right reasons during training? In the podcast, Mark Xu acknowledge that this is an assumption he's making at this stage, so presumably he's planning to think about how to ensure this in the future.)

[^1] Going to make the simplifying assumption that the AI we've trained is a neural network