An interesting read from a respected source. However…

Stuart Russell, a professor at UC Berkeley who works on AI safety, says the idea of using a less powerful AI model to control a more powerful one has been around for a while. He also says it is unclear that the methods that currently exist for teaching AI to behave are the way forward, because they have so far failed to make current models behave reliably. Source: Wired

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

OpenAI. Weak-to-Strong Generalisation: Eliciting Strong Capabilities With Weak Supervision.

Collin Burns∗ Pavel Izmailov∗ Jan Hendrik Kirchner∗ Bowen Baker∗ Leo Gao∗ Leopold Aschenbrenner∗ Yining Chen∗ Adrien Ecoffet∗ Manas Joglekar∗ Jan Leike Ilya Sutskever Jeff Wu∗ OpenAI

ABSTRACT Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that tech- niques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong gen- eralization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level perfor- mance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

LEARN MORE

IEEE Spectrum. OpenAI Demos a Control Method for Superintelligent AI

The researchers asked GPT-2 to command the much more powerful GPT-4

One day, the theory goes, we humans will create AI systems that outmatch us intellectually. That could be great if they solve problems that we’ve been thus far unable to crack (think cancer or climate change), or really bad if they begin to act in ways that are not in humanity’s best interests, and we’re not smart enough to stop them.

So earlier this year, OpenAI launched its superalignment program, an ambitious attempt to find technical means to control a superintelligent AI system, or “align” it with human goals. OpenAI is devoting 20 percent of its compute to this effort, and hopes to have solutions by 2027.

The biggest challenge for this project: “This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to,” says Collin Burns, a member of OpenAI’s superalignment team. “This makes it very tricky to study—but I think we also have no choice.”

The first preprint paper to come out from the superalignment team showcases one way the researchers tried to get around that constraint. They used an analogy: Instead of seeing whether a human could adequately supervise a superintelligent AI, they tested a weak AI model’s ability to supervise a strong one. In this case, GPT-2 was tasked with supervising the vastly more powerful GPT-4. Just how much more powerful is GPT-4? While GPT-2 has 1.5 billion parameters, GPT-4 is rumored to have 1.76 trillion parameters (OpenAI has never released the figures for the more powerful model).

One day, the theory goes, we humans will create AI systems that outmatch us intellectually. That could be great if they solve problems that we’ve been thus far unable to crack (think cancer or climate change), or really bad if they begin to act in ways that are not in humanity’s best interests, and we’re not smart enough to stop them.

So earlier this year, OpenAI launched its superalignment program, an ambitious attempt to find technical means to control a superintelligent AI system, or “align” it with human goals. OpenAI is devoting 20 percent of its compute to this effort, and hopes to have solutions by 2027.

The biggest challenge for this project: “This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to,” says Collin Burns, a member of OpenAI’s superalignment team. “This makes it very tricky to study—but I think we also have no choice.”

The first preprint paper to come out from the superalignment team showcases one way the researchers tried to get around that constraint. They used an analogy: Instead of seeing whether a human could adequately supervise a superintelligent AI, they tested a weak AI model’s ability to supervise a strong one. In this case, GPT-2 was tasked with supervising the vastly more powerful GPT-4. Just how much more powerful is GPT-4? While GPT-2 has 1.5 billion parameters, GPT-4 is rumored to have 1.76 trillion parameters (OpenAI has never released the figures for the more powerful model).

It’s an interesting approach, says Jacob Hilton of the Alignment Research Center; he was not involved with the current research, but is a former OpenAI employee. “It has been a long-standing challenge to develop good empirical testbeds for the problem of aligning the behavior of superhuman AI systems,” he tells IEEE Spectrum. “This paper makes a promising step in that direction and I am excited to see where it leads.”

“This is a future problem about future models that we don’t even know how to design, and certainly don’t have access to.” —COLLIN BURNS, OPENAI

The OpenAI team gave the GPT pair three types of tasks: chess puzzles, a set of natural language processing (NLP) benchmarks such as commonsense reasoning, and questions based on a dataset of ChatGPT responses, where the task was predicting which of multiple responses would be preferred by human users. In each case, GPT-2 was trained specifically on these tasks—but since it’s not a very large or capable model, it didn’t perform particularly well on them. Then its training was transferred over to a version of GPT-4 with only basic training and no fine-tuning for these specific tasks. But remember: GPT-4 with only basic training is still a much more capable model than GPT-2.

The researchers wondered whether GPT-4 would make the same mistakes as its supervisor, GPT-2, which had essentially given it instructions for how to do the tasks. Remarkably, the stronger model consistently outperformed its weak supervisor. The strong model did particularly well on the NLP tasks, achieving a level of accuracy comparable to GPT-3.5. Its results were less impressive with the other two tasks, but they were “signs of life” to encourage the group to keep trying with these tasks, says Leopold Aschenbrenner, another researcher on the superalignment team.

The researchers call this phenomenon weak-to-strong generalization; they say it shows that the strong model had implicit knowledge of how to perform the tasks, and could find that knowledge within itself even when given shoddy instructions.

In this first experiment, the approach worked best with the NLP tasks because they’re fairly simple tasks with clear right and wrong answers, the team says. It did worst with the tasks from the ChatGPT database, in which it was asked to determine which responses humans would prefer, because the answers were less clear cut. “Some were subtly better, some were subtly worse,” says Aschenbrenner.

Could this alignment technique scale to superintelligent AI?

Burns gives an example of how a similar situation might play out in a future with superintelligent AI. “If you ask it to code something, and it generates a million lines of extremely complicated code interacting in totally new ways that are qualitatively different from how humans program, you might not be able to tell: Is this doing what we ask it to do?” Humans might also give it a corollary instruction, such as: Don’t cause catastrophic harm in the course of your coding work. If the model has benefitted from weak-to-strong generalization, it might understand what it means to cause catastrophic harm and see—better than its human supervisors can—whether its work is straying into dangerous territory.

“We can only supervise simple examples that we can understand,” Burns says. “We need [the model] to generalize to much harder examples that superhuman models themselves understand. We need to elicit that understanding of: ‘is it safe or not, does following instructions count,’ which we can’t directly supervise.”

Some might argue that these results are actually a bad sign for superalignment, because the stronger model deliberately ignored the (erroneous) instructions given to it and pursued its own agenda of getting the right answers. But Burns says that humanity doesn’t want a superintelligent AI that follows incorrect instructions. What’s more, he says, “in practice many of the errors of the weak supervisor will be more of the form: ‘this problem is way too hard for me, and I don’t have a strong opinion either way.’” In that case, he says, we’ll want a superintelligence that can figure out the right answers for us.

To encourage other researchers to chip away at such problems, OpenAI announced today that it’s offering US $10 million in grants for work on a wide variety of alignment approaches. “Historically, alignment has been more theoretical,” says Pavel Izmailov, another member of the superalignment team. “I think this is work that’s available to academics, grad students, and the machine learning community.” Some of the grants are tailored for grad students and offer both a $75,000 stipend and a $75,000 compute budget.

Burns adds: “We’re very excited about this, because I think for the first time we really have a setting where we can study this problem of aligning future superhuman models.” It may be a future problem, he says, but they can “make iterative empirical progress today.”

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.