OpenAI Is Working on “Superalignment”

Sigh.

ChatGPT is one year old this month.  Has it only been a year?

We’ve learned so much, most of it deeply dubious.  Preposterously large machine learning models, trained on screen scrapes from the Internet, fer goodness sake, produce preposterously stupid results.  That’s close enough…full speed ahead!

A new term has been coined, “alignment”.  This is a new word for what used to be called “getting our software to do what we meant.”  When I write software, and it does something I don’t want it to, that is called, “wrong”. 

But the ML crowd calls this “alignment”, as in “aligned with human intentions”.  Today’s public models are aligned” by a special “tuning” phase, in which humans fiddle the system to suppress obviously dangerous or legally actionable oopsies [1].


Even if this works today, it won’t work for long. 

Worse, the whole enterprise is crazy.  “Aligned”?  Aligned with who, in what context?

We know the answer:  aligned with the people who own the model and are using it for their own interests. 

Not aligned with the interests of the users, good sense, or anything else.


Setting aside the pointlessness of the exercise, OpenAI is pursuing technical methods to “align” larger models, AKA “superalignement” [1]. 

Since the only tool we have in our toolbox is machine learning, let’s try to use a ML model to supervise another ML model.

It is difficult for me to analyze this work, because it makes so little sense to me.

The idea appears to be to use a relatively small ML to train a much larger ML model [2].  Presumably, the smaller trainer was trained by humans, so basically, this is a kind of multi level learning process.   A very complicated game of telephone, perhaps.

In the study, the OpenAI researchers report that using GPT-2 to supervise training a GPT-4 model [1].  The latter has several orders of magnitude more parameters than the former.  The obvious question is, will the larger model simply learn to imitate the mistakes of the teacher?  The results seem to show that the student does better than the teacher, though not as well as when custom tuned by humans.

This is interesting, but it’s not clear to me that this means that superalignment is working or not.  I mean, are the results aligned or not?  I’m not sure.

In any case, the experiment appears to be only trivial tasks.  This isn’t surprising, considering that “alignment” is pretty subjective.  How are you going to create a teacher that is “aligned” in the first place?  And if you had one, how will it teach the student model?  And how will you know if the (allegedly superintelligent) student is “aligned” with the teacher, or not?

As far as I can tell, these experiments haven’t really scratched the surface of these questions.

Personally, I consider “alignment” to be mainly a PR exercise, intended to make people and authorities feel less threatened.

My own prediction is that this stuff will never really work.

As I said, “a very complicated game of telephone”, using one tool that gives unpredictable answers to train another tool that gives even more unpredictable answers. 

The results will be unpredictable squared, or perhaps to the unpredictable power.


The long term goal is to try to control “superhuman” models, which so too complicated that “humans will not be able to provide reliable supervision” ([1], p.1)   I don’t think that adding technology, making the system even more complicated, is going to help.


Speaking of “alignment”, in the recent management purge at OpenAI, the technology (superhuman or not) has successfully “aligned” the humans in the organization to its own priorities.  Things will go much smoother now that they are all “aligned”, pursuing the same goals….


  1. Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskevet, and Jeff Wu, WEAK-TO-STRONG GENERALIZATION: ELICITING STRONG CAPABILITIES WITH WEAK SUPERVISION. OpenAI, 2023. https://openai.com/research/weak-to-strong-generalization
  2. Eliza Strickland, OpenAI Demos a Control Method for Superintelligent AI, in IEEE Spectrum – Artificial Intelligence, December 14, 2023. https://spectrum.ieee.org/openai-alignment

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.