KHAN. “So control super important for the next two years and we’re optimistic we can do it. Sound about right?” (51:07)
BENTON. “Oh. Hope so.”
Anthropic researchers Ethan Perez, Joe Benton, and Akbir Khan discuss AI control—an approach to managing the risks of advanced AI systems. They discuss real-world evaluations showing how humans struggle to detect deceptive AI, the three major threat models researchers are working to mitigate, and the overall idea of controlling highly-capable AI systems whose goals may differ from our own. 0:00 Introduction 0:33 What is AI control? 2:56 Control evaluations in practice 5:39 Results from evaluations 7:27 Monitoring protocols 13:18 How control differs from alignment 16:09 The challenge of alignment faking 23:10 Ensuring evaluations work for future models 26:09 Open questions in control research 34:15 Lessons learned from control 37:14 Why work on control now? 43:26 Key threat models 48:35 Optimistic signs
