FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT.

The Dilemma of Safe AI Computing and AI Alignment?

At an unknown and unpredictable future moment, failure is catastrophic. Deadly. Extinction of Humanity.

If you want to increase your success rate, double your failure rate. – Thomas J. Watson, Chairman and CEO of IBM 1914–1956

The road to wisdom? Well, it’s plain and simple to express: Err and err and err again but less and less and less. – Piet Hein, Danish polymath

This is a very lethal problem, it has to be solved one way or another… and failing on the first really dangerous try is fatal. — Eliezer Yudkowsky, MIRI

“Unaligned AGI could pose substantial risks to humanity and solving the AGI alignment problem could be so difficult that it will require all of humanity to work together.” – OpenAI, Our approach to AI Alignment

Early Innovator Safe AI Organisations

Machine Intelligence Research Institute (2000)
Future of Humanity Institute, at Oxford (2005)
The Centre for the Study of Existential Risk, at Cambridge (2012)
Future of Life Institute, at MIT (2014)
The Center for Human-Compatible Artificial Intelligence at University of California, Berkeley (2016)
Partnership on AI (2016)
ML Commons (2018)
Open Philanthropy (2017)
Council of Europe and Artificial Intelligence (2021)
Center for AI Safety (2022)
AI Alliance (2023)
AI Governance Alliance, by World Economic Forum (2023)
Artificial Intelligence Safety Institute, at U.S. NIST (2023)
AI Safety Institute, at GOV UK (2023)
Frontier Model Forum (2023)
IEEE (2023)

62 INSTITUTIONS and ORGs…

GOVERNMENTS…

INDUSTRY…

Alphabet Inc. (Deepmind and Google )
AMD
Amazon and AWS AI Chips
Anthropic
Apple
AWS and (Amazon Q)
CoreWeave
Google Deepmind
Grok (on X, aka Twitter)
HPE (Hewlett Packard Enterprise)
IBM
Inflection
Meta (Facebook, Instagram, WhatsApp)
Microsoft and Azure
Nvidia
OpenAI
Oracle
Salesforce
SAP
Qualcomm

AI SECURITY…

INSTITUTIONS and ORGs Supported [$56,170,000] by the Survival and Flourishing Fund (SFF)

AI Safety Support Ltd – Equivalency Determination
Alignment Research Center
Alliance to Feed the Earth in Disasters (ALLFED)
Association for the Advancement of Artificial Intelligence
Basis Research Institute
Berkeley Existential Risk Initiative
Cambridge in America
Carnegie Mellon University (CMU)
Center for AI Safety, Inc.
Center for Applied Rationality
Center For Effective Altruism
Center for Innovative Governance (d/b/a Charter Cities Institute)
Center for Mindful Learning
Center for Strategic and International Studies
Center on Long-Term Risk
Centre for Effective Altruism
Centre for Enabling EA Learning & Research
Chancellor, Masters and Scholars of the University of Cambridge
Children, Families, and Communities
Constructive Dialogue Institute Inc.
Convergence Analysis
Earth Law Center
Effective Altruism Foundation, Inc.
Effective Ventures Foundation
Effektiv Altruisme Norge
European Biostasis Foundation
FAR AI, Inc.
Foresight Institutee
Founders for Good
fp21
Future of Humanity Foundation
Future of Life Institute
Generation Pledge, Inc.
Hansjorg Wyss Institute For Biologically Inspired Engineering
Idea Foundry
Institute for Advanced Consciousness Studies (IACS)
Johns Hopkins University
Legal Priorities Inc.
Leverage Research
Longevity Research Institute
Machine Intelligence Research Institute
Manifold for Charity
Median Foundation
Median Group
Mercatus Center Inc
Moonlight Institute
New Science Research, Inc.
Open Collective Foundation
Ought Inc.
PARPA, Inc.
Players Philanthropy Fund (PPF)
Pragmatist Foundation
Quantified Uncertainty Research Institute
RadicalxChange Foundation Ltd.
Redwood Research Group Inc.
Rethink Charity
Rethink Priorities
Ronin Institute for Independent Scholarship Incorporated
SaferAI
SFF DAF
Social and Environmental Entrepreneurs (SEE)
Social Good Fund
Stanford University
The Benjamin Franklin Society Library Inc.
The Center for Election Science
The Collective Intelligence Project
The Future Society
The Goodly Institute
The Mercatus Center
The University of Chicago
Topos Institute
UC Berkeley Foundation
Unite America Institute Inc.
University of Louisville Foundation, Inc.
University of Oxford
University of Wisconsin Foundation
Whylome, Inc

INSTITUTIONS and ORGs Supporting the AI Alliance

Agency for Science, Technology and Research (A*STAR)
Aitomatic
AMD
Anyscale
Cerebras
CERN
Cleveland Clinic
Cornell University
Dartmouth
Dell Technologies
Ecole Polytechnique Federale de Lausanne
ETH Zurich
Fast.ai
Fenrir, Inc.
FPT Software
Hebrew University of Jerusalem
Hugging Face
IBM
Abdus Salam International Centre for Theoretical Physics (ICTP)
Imperial College London
Indian Institute of Technology Bombay
Institute for Computer Science, Artificial Intelligence
Intel
Keio University
LangChain
LlamaIndex
Linux Foundation
Mass Open Cloud Alliance, operated by Boston University and Harvard
Meta
Mohamed bin Zayed University of Artificial Intelligence
MLCommons
National Aeronautics and Space Administration
National Science Foundation
New York University
NumFOCUS
OpenTeams
Oracle
Partnership on AI
Quansight
Red Hat
Rensselaer Polytechnic Institute
Roadzen
Sakana AI
SB Intuitions
ServiceNow
Silo AI
Simons Foundation
Sony Group
Stability AI
Together AI
TU Munich
UC Berkeley College of Computing, Data Science, and Society
University of Illinois Urbana-Champaign
The University of Notre Dame
The University of Texas at Austin
The University of Tokyo
Yale University

Members of AI Safety Institute Consortium (AISIC) by NIST (as of 07 February 2024)

Accel AI Institute
Accenture LLP
Adobe
Advanced Micro Devices (AMD)
AFL-CIO Technology Institute (Provisional Member)
AI Risk and Vulnerability Alliance
AI & Data (part of the Linux Foundation)
AIandYou
Allen Institute for Artificial Intelligence
Alliance for Artificial Intelligence in Healthcare
Altana
Alteryx
Amazon.com
American University, Kogod School of Business
AmpSight
Anika Systems Incorporated
Anthropic
Apollo Research
Apple
Ardent Management Consulting
Aspect Labs
Atlanta University Center Consortium
Autodesk, Inc.
BABL AI Inc.
Backpack Healthcare
Bank of America
Bank Policy Institute
Baylor College of Medicine
Beck’s Superior Hybrids
Benefits Data Trust
Booz Allen Hamilton
Boston Scientific
BP
BSA | The Software Alliance
BSI Group America
Canva
Capitol Technology University
Carnegie Mellon University
Casepoint
Center for a New American Security
Center For AI Safety
Center for Security and Emerging Technology (Georgetown University)
Center for Democracy and Technology
Centers for Medicare & Medicaid Services
Centre for the Governance of AI
Cisco Systems
Citadel AI
Citigroup
CivAI
Civic Hacker LLC
Cleveland Clinic
Coalition for Content Provenance and Authenticity (part of the Linux Foundation)
Coalition for Health AI (CHAI) (Provisional Member)
Cohere
Common Crawl Foundation
Cornell University
Cranium AI
Credo AI
CrowdStrike
Cyber Risk Institute
Dark Wolf Solutions
Data & Society Research Institute
Databricks
Dataiku
DataRobot
Deere & Company
Deloitte
Beckman Coulter
Digimarc
DLA Piper
Drexel University
Drummond Group
Duke University
The Carl G Grefenstette Center for Ethics at Duquesne University
EBG Advisors
EDM Council
Eightfold AI
Elder Research
Electronic Privacy Information Center
Elicit
EleutherAI Institute
Emory University
Enveil
EqualAI
Erika Britt Consulting
Ernst & Young, LLP
Exponent
FAIR Institute
FAR AI
Federation of American Scientists
FISTA
ForHumanity
Fortanix, Inc.
Free Software Foundation
Frontier Model Forum
Financial Services Information Sharing and Analysis Center (FS-ISAC)
Future of Privacy Forum
Gate Way Solutions
George Mason University
Georgia Tech Research Institute
GitHub
Gladstone AI
Google
Gryphon Scientific
Guidepost Solutions
Hewlett Packard Enterprise
Hispanic Tech and Telecommunications Partnership (HTTP)
Hitachi Vantara Federal
HireVue (Provisional Member)
Hugging Face
Human Factors and Ergonomics Society
Humane Intelligence
Hypergame AI
IBM
Imbue
Indiana University

Inflection AI
Information Technology Industry Council
Institute for Defense Analyses
Institute for Progress
Institute of Electrical and Electronics Engineers, Incorporated (IEEE)
Institute of International Finance
Intel Corporation
Intertrust Technologies
Iowa State University, Translational AI Center (TrAC)
JPMorgan Chase
Johns Hopkins University
Kaiser Permanente
Keysight Technologies
Kitware, Inc.
Knexus Research
KPMG
LA Tech4Good
Leadership Conference Education Fund, Center for Civil Rights and Technology
Leela AI
Lucid Privacy Group
Lumenova AI
Magnit Global Solutions
Manatt, Phelps & Phillips
MarkovML
Massachusetts Institute of Technology, Lincoln Laboratory
Mastercard
Meta
Microsoft
MLCommons
Model Evaluation and Threat Research (METR, formerly ARC Evals)
Modulate
MongoDB
National Fair Housing Alliance
National Retail Federation
New York Public Library
New York University
NewsGuard Techologies
Northrop Grumman
NVIDIA
ObjectSecurity LLC
Ohio State University
O’Neil Risk Consulting & Algorithmic Auditing, Inc. (ORCAA)
OpenAI
OpenPolicy
Open Source Security Foundation(part of the Linux Foundation)
OWASP (AI Exchange & Top 10 for LLM Apps)
University of Oklahoma, Data Institute for Societal Challenges (DISC)
University of Oklahoma, NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES)
Palantir
Partnership on AI (PAI)
Pfizer
Preamble
PwC
Princeton University
Purdue University, Governance and Responsible AI Lab (GRAIL)
Qualcomm Incorporated
Queer in AI
RAND Corporation
Redwood Research Group
Regions Bank
Responsible AI Institute
Robust Intelligence
RTI International
SaferAI
Salesforce
SAS Institute
SandboxAQ
Scale AI
Science Applications International Corporation
Scripps College
SecureBio
Society of Actuaries Research Institute
Software & Information Industry Association
Software Package Data Exchange (part of the Linux Foundation)
SonarSource
SRI International
Stability AI (Provisional Member)
stackArmor
Stanford Institute for Human-Centered AI, Stanford Center for Research on Foundation Models, Stanford Regulation, Evaluation, and Governance Lab
State of California, Department of Technology
State of Kansas, Office of Information Technology Services
StateRAMP
Subtextive
Syracuse University
Taraaz
Tenstorrent USA
Texas A&M University
Thomson Reuters (Provisional Member)
Touchstone Evaluations
Trustible
TrueLaw
Trufo
UnidosUS
UL Research Institutes
University at Albany, SUNY Research Foundation
University at Buffalo, Institute for Artificial Intelligence and Data Science
University at Buffalo, Center for Embodied Autonomy and Robotics
University of Texas at San Antonio (UTSA)
University of Maryland, College Park
University Of Notre Dame Du Lac
University of Pittsburgh
University of South Carolina, AI Institute
University of Southern California
U.S. Bank National Association
Vanguard
Vectice
Visa
Wells Fargo & Company
Wichita State University, National Institute for Aviation Research
William Marsh Rice University
Wintrust Financial Corporation
Workday

Introduction to Pragmatic AI Safety (Thomas Woodside and Dan Hendrycks)

A Bird’s Eye View of the ML Field [PAIS #2]
Complex Systems for AI Safety [PAIS #3]
Perform Tractable Research While Avoiding Capabilities Externalities [PAIS #3]
Open Problems in AI X-Risk [PAIS #5]

The Pragmatic AI Safety Sequence:

In this sequence, we will describe a pragmatic approach for reducing existential risk from AI.
In the second post, which will be released alongside this post, we will present a bird’s eye view of the machine learning field. Where is ML research published? What is the relative size of different subfields? How can you evaluate the credibility or predictive power of ML professors and PhD students? Why are evaluation metrics important? What is creative destruction? We will also discuss historical progress in different subfields within ML and paths and timelines towards AGI.
The third post will provide a background on complex systems and how they can be applied to both influencing the AI research field and researching deep learning.(Edit: the original third post has been split into what will now be the third and fourth posts).
The fourth post will cover problems with certain types of asymptotic reasoning and introduce the concept of capabilities externalities.
The fifth post will serve as a supplement to Unsolved Problems in ML Safety. Unlike that paper, we will explicitly discuss the existential risk motivations behind each of the areas we advocate.
The sixth and final post will focus on tips for how to conduct good research and navigate the research landscape.
A supplement to this sequence is X-Risk Analysis for AI Research.

Mechanistic Interpretability

“Keeping AI under control through mechanistic interpretability” Speaker: Prof. Max Tegmark (MIT)

Learn More: MIT Mechanistic Interpretability Conference 2023

MIT Department of Physics: The Impact of chatGPT talks (2023)

Provably Safe AGI

Steve Omohundro presentation at the MIT Mechanistic Interpretability Conference 2023

Learn More: steveomohundro.com

The slides are available here

ALIGNMENT MAP. The Future of Life Institute (2023)

FLI Value Alignment Research Landscape: Security. Control. Foundations. Governance. Ethics. Verification. Validation.

The project of creating value-aligned AI is perhaps one of the most important things we will ever do. However, there are open and often neglected questions regarding what is exactly entailed by ‘beneficial AI.’ Value alignment is the project of one day creating beneficial AI and has been expanded outside of its usual technical context to reflect and model its truly interdisciplinary nature. For value-aligned AI to become a reality, we need to not only solve intelligence, but also the ends to which intelligence is aimed and the social/political context, rules, and policies in and through which this all happens. This landscape synthesizes a variety of AI safety research agendas along with other papers in AI, machine learning, ethics, governance, and AI safety, robustness, and beneficence research. It lays out what technical research threads can help us to create beneficial AI, and describes how these many topics tie together.

INTELLIGENCE EXPLOSION (1965) and NOTE ON CONFINEMENT (1973)

Speculations Concerning the First Ultraintelligent Machine I.J. Good

“Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion’, and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.” – I.J. Good

A Note on the Confinement Problem Lampson

‘…the problem of confining a program during its execution so that it cannot transmit information to any other program except its caller. …We want to be able to confine an arbitrary program…. any program, if confined, will be unable to leak data. A misbehaving program may well be trapped as a result of an attempt to escape’

To address the Confinement Problem Lampson introduced the Laws of Confinement:
1) Total isolation: A confined program shall make no calls on any other program.
2) Transitivity: If a confined program calls another program which is not trusted, the called program must also be confined.
3) Masking: A program to be confined must allow its caller to determine all its inputs into legitimate and covert channels.
4) Enforcement: The supervisor must ensure that a confined program’s input to covert channels conforms to the caller’s specifications.

OpenAI (2023)

Our Approach to AI Alignment

We are improving our AI systems’ ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other alignment problems.

Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned. Using scientific experiments, we study how alignment techniques scale and where they will break.

We tackle alignment problems both in our most capable AI systems as well as alignment problems that we expect to encounter on our path to AGI. Our main goal is to push current alignment ideas as far as possible, and to understand and document precisely how they can succeed or why they will fail. We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.

Unaligned AGI could pose substantial risks to humanity and solving the AGI alignment problem could be so difficult that it will require all of humanity to work together. Therefore we are committed to openly sharing our alignment research when it’s safe to do so: We want to be transparent about how well our alignment techniques actually work in practice and we want every AGI developer to use the world’s best alignment techniques.

At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent. It has three main pillars:

Training AI systems using human feedback
Training AI systems to assist human evaluation
Training AI systems to do alignment research

Aligning AI systems with human values also poses a range of other significant sociotechnical challenges, such as deciding to whom these systems should be aligned. Solving these problems is important to achieving our mission, but we do not discuss them in this post.

We want to be transparent about how well our alignment techniques actually work in practice and we want every AGI developer to use the world’s best alignment techniques.

ON CONTAINMENT, ALIGNMENT, EXISTENTIAL RISK, BILL OF RIGHTS, REGULATIONS…

Artificial Intelligence Safety and Security (Artificial Intelligence and Robotics Series) by Roman V. Yampolskiy (Editor)
Guidelines for Artificial Intelligence Containment (Yampolskiy et al, 2017)
The Nature of Self-Improving Artificial Intelligence (Omohundro, 2007)
Leakproofing the Singularity (Yampolskiy, 2012)
Artificial Intelligence Safety and Cybersecurity: a Timeline of AI Failures (Yampolskiy, 2015)
Thinking Inside the Box: Controlling and Using an Oracle AI (Stuart Armstrong Anders Sandberg Nick Bostrom, 2012)
Superintelligence. Paths. Dangers. Strategies. (Bostrom, 2014) Superintelligence reading group & Capability control methods (KatjaGrace)
Research Priorities for Robust and Beneficial Artificial Intelligence (Russell, Dewey and Tegmark, 2015)
The AGI Containment Problem (Babcock, Kramar and Yampolskiy, 2016)
Concrete Problems in AI Safety (Amodei et al., 2016)
Strategic Implications of Openness in AI Development (Nick Bostrom, Future of Humanity Institute, 2017)
On the Future – Prospects for Humanity (Rees, 2018)
AGI safety from first principles (Ngo, 2020)
AGI Ruin: A List of Lethalities (Yudkowsky, 2022) Summary (McAleese)
Ethical and social risks of harm from Language Models (Weidinger et al. Deepmind, 2021)
On the Opportunities and Risks of Foundation Models (Bommasani et al., 2022)
A Mechanistic Interpretability Analysis of Grokking (Nanda and Lieberum, 2022)
X-Risk Analysis for AI Research (Hendrycks and Mazeika, 2022)
New Research: Advanced AI may tend to seek power *by default*. Edouard Harris on AI safety and the risks that come with advanced AI (2022)
The Alignment Problem from a Deep Learning Perspective (Ngo, Chan and Mindermann, 2023)
OpenAI. Aligning language models to follow instructions (Lowe and Leike, 2023)
How Not To Destroy the World With AI (Russell, 2023)
Governance of superintelligence (Altman, Brockman and Sutskever, 2023)
Natural Selection Favors AIs over Humans (Hendrycks, 2023)
Let’s Verify Step by Step and Improving Mathematical Reasoning with Process Supervision (OpenAI blog, 2023)
GPT-4 System Card (OpenAI, March 23, 2023)
LLaMA-2 from the Ground Up. Everything you need to know about the best open-source LLM on the market. (2023)
Artificial Intelligence Risk Management Framework (AI RMF 1.0) (NIST, 2023)
A European approach to artificial intelligence (EU, 2023) PDF
GOV UK. Policy paper. A pro-innovation approach to AI regulation (2023)
BLUEPRINT FOR AN AI BILL OF RIGHTS. MAKING AUTOMATED SYSTEMS WORK FOR THE AMERICAN PEOPLE (The White House, 2023)
Proposal for a REGULATION OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS (EU, 2023)
ANNEXES to the Proposal for a Regulation of the European Parliament and of the Council LAYING DOWN HARMONISED RULES ON ARTIFICIAL INTELLIGENCE (ARTIFICIAL INTELLIGENCE ACT) AND AMENDING CERTAIN UNION LEGISLATIVE ACTS (EU, 2023)
AI Alignment and AI Safety and Existential risk from artificial general intelligence and Technological singularity and Artificial intelligence (Wikipedia, 2023)
Mechanistic Interpretability Quickstart Guide – Lesswrong (2023)
Right on Track: NVIDIA Open-Source Software Helps Developers Add Guardrails to AI Chatbots. NeMo Guardrails helps enterprises keep applications built on large language models aligned with their safety and security requirements. (2023)
AI Deception: When Your Artificial Intelligence Learns to Lie. We need to understand the kinds of deception an AI agent may learn on its own before we can start proposing technological defenses – IEEE Spectrum (2023)
A Mechanistic Interpretability Analysis of Grokking – Lesswrong – more links (2022/23)
Tracing How LangChain Works Behind the Scenes – Kelvin Lu
Opinion | We Need a Manhattan Project for AI Safety. AI presents an enormous threat. It deserves an enormous response. – Politico (2023)
What Is ChatGPT Doing … and Why Does It Work? Stephen Wolfram
Discovering Language Model Behaviors with Model-Written Evaluations. RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. (2023)
Fundamental Limitations of Alignment in Large Language Models (2023)
Generative Language Modeling for Automated Theorem Proving (2020)
HyperTree Proof Search (HTPS) for Neural Theorem Proving (2023)
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Using a sparse autoencoder, we extract a large number of interpretable features from a one-layer transformer. (2023)
Towards Understanding Grokking: An Effective Theory of Representation Learning (2023)
Probabilistic Programming (Cornell University lecture)
Anthropomorphic reasoning about neuromorphic AGI safety (2016) – PDF

“Language models have become more capable and more broadly deployed, but our understanding of how they work internally is still very limited. For example, it might be difficult to detect from their outputs whether they use biased heuristics or engage in deception. Interpretability research aims to uncover additional information by looking inside the model.” – OpenAI, Language models can explain neurons in language models

Learn more at Future of Life Institute

Artificial Intelligence. From recommender algorithms to chatbots to self-driving cars, AI is changing our lives. As the impact of this technology grows, so will the risks.
Benefits & Risks of Artificial Intelligence

Featured posts on Artificial Intelligence

- Can we rely on information sharing?
  We have examined the Terms of Use of major General-Purpose AI system developers and found that they fail to provide assurances about the quality, reliability, and accuracy of their products or services.
  October 26, 2023
  Written Statement of Dr. Max Tegmark to the AI Insight Forum
  The Future of Life Institute President addresses the AI Insight Forum on AI innovation and provides five US policy recommendations.
  October 24, 2023
  As Six-Month Pause Letter Expires, Experts Call for Regulation on Advanced AI Development
  This week will mark six months since the open letter calling for a six month pause on giant AI experiments. Since then, a lot has happened. Our signatories reflect on what needs to happen next.
  September 21, 2023
  Characterizing AI Policy using Natural Language Processing
  As interest in Artificial Intelligence (AI) grows across the globe, governments have focused their attention on identifying the soft and […]
  December 16, 2022
  Superintelligence survey
  Click here to see this page in other languages: Chinese French German Japanese Russian The Future of AI – What Do You Think? Max […]
  August 15, 2017
  A Principled AI Discussion in Asilomar
  The Asilomar Conference took place against a backdrop of growing interest from wider society in the potential of artificial intelligence […]
  January 18, 2017
  Introductory Resources on AI Safety Research
  Reading list to get up to speed on the main ideas in the field. The resources are selected for relevance and/or brevity, […]
  February 29, 2016
  AI FAQ
  Frequently Asked Questions about the Future of Artificial Intelligence Click here to see this page in other languages: Chinese German Japanese Korean […]
  October 12, 2015