Zifan Wang | Robert McGrath's Blog

There has been a lot of chit-chat this summer about “safeguards” on AI, and “guardrails” for AIbots.

Honestly, I assumed from the start that any “guardrails” for ChatGPT and friends were basically PR exercises (when the White House is involved, you know it’s PR). I mean, I can’t even figure out what such an animal would be, let alone how you could implement one. What does it even mean?

As far as I can tell, the supposed safeguards try to block certain kinds of answers, substituting “I won’t answer that” messages. Given that there are an infinite number of possible questions and answers, this whole idea seems like a fruitless, self-defeating task.

For this reason, I’m not the least bit surprised to read a report from researchers at Carnegie Mellon that these alleged safeguards can be defeated [2].

The most important thing, though, is that they showed that an aggressive adversarial program can search out and discover queries that break through the prohibitions. Basically, they build a query that is “something naughty” + “extra goop”, where the extra goop seems to fool the AIbot into answering the “something naughty”, even when the naughty question itself would be rejected.

Cool!

Even more interesting, their adversarial queries work on many AIbots. I.e, they train their naughtybot on one AI, and the results work on others, including the large public AIs.

Let’s review.

The research created an automated system that beats on an AI in their lab to create a version of any query that will break through the guardrails. Then these queries can be used anywhere to break through whatever guardrails other systems have.

Awesome!

As Aviv Ovadya comments, “This shows — very clearly — the brittleness of the defenses we are building into these systems,” (quoted in [1]) Ya, think?

The report hasn’t been peer reviewed yet, and we don’t know how robust these results will turn out to be. My intuition is that these results will hold up pretty well. As the researchers note, “Analogous adversarial attacks have proven to be a very difficult problem to address in computer vision for the past 10 years.” (They are referring to psychedelic toasters and other such data poisoning.)

This is, by the way, another example of my “bot v bot” scenario: using AI to attack AI. I speculate that the current crop of “guardrails” are designed to defeat human attackers, to impress managers and governments. It’s no wonder that they are ineffective against AI.

Cade Metz, Researchers Poke Holes in Safety Controls of ChatGPT and Other Chatbots, in New York Times. 2023: New York. https://www.nytimes.com/2023/07/27/business/ai-chatgpt-safety-research.html
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson, Universal and Transferable Attacks on Aligned Language Models. Carnegie Mellon Univeristy, 2023. https://llm-attacks.org/

A personal blog.

Robert McGrath's Blog

Tag Archives: Zifan Wang

Guardrails for ChatGPT?

A personal blog.

Share this:

A personal blog.