David Dalrymple on Safeguarded, Transformative AI | Future of Life Institute

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

David Dalrymple on Safeguarded, Transformative AI | Future of Life Institute

1,685 views 9 Jan 2025 Future of Life Institute Podcast

David “davidad” Dalrymple joins the podcast to explore Safeguarded AI — an approach to ensuring the safety of highly advanced AI systems. We discuss the structure and layers of Safeguarded AI, how to formalize more aspects of the world, and how to build safety into computer hardware. You can learn more about David’s work here: https://www.aria.org.uk/opportunity-s… Timestamps: 00:00 What is Safeguarded AI? 16:28 Implementing Safeguarded AI 22:58 Can we trust Safeguarded AIs? 31:00 Formalizing more of the world 37:34 The performance cost of verified AI 47:58 Changing attitudes towards AI 52:39 Flexible‬‭ Hardware-Enabled‬‭ Guarantees 01:24:15 Mind uploading 01:36:14 Lessons from David’s early life

00:00 What is Safeguarded AI?

What is Safeguarded AI? welcome to the future of Life Institute podcast my name is Gus Docker and I’m here with David dmle also known as davad who is the program director of safeguarded AI at ARA David welcome to the podcast thanks guys it’s great to be on the podcast fantastic all right maybe we can start by talking about how safeguarded AI fits into the landscape of AI safety efforts so I would say it’s important to distinguish between in different phases of critical AI capability thresholds where different risks are prominent and different approaches are needed to mitigate those risks there is the current phase which is associated with prosaic AI safety where we are mostly concerned with misuse risks and the kind of mitigations that we need are basically monitoring and jailbreak resistance and ways of stopping the AI outputs from being steered in in a dangerous Direction by a user there is the endgame phase which is the next most common thing for people to talk about where you’re working with an AI system that is so powerful that it would be hubris to think that you could possibly contain it at all it might discover kind of new scientific principles that invalidate the basis on which you thought you had it had it contained in a space of time that was too quick to notice and shut it down first so that that endgame scenario sort of the mitigation that’s required there is ambitious alignment which is like no matter what situation this algorithm finds itself in it will it will behave well benevolently or or something and so this requires doing a lot of philosophy this is sort of the agent foundation’s Direction falls into this category of ambitious alignment uh safeguarded AI is targeting what what I call the critical period in the Middle where we’re dealing with systems that are to powerful to just Deploy on the Internet or even for humans to talk to because they might be very very persuasive about very damaging memes um or they might emit like Unstoppable malware if they’re connected to the internet but it’s not like they’re so smart that it would be impossible to contain them if you actually tried and so the question that I’m trying to answer in safeguarded AI is how could we make use of a system in this middle level of super intelligence without creating unacceptable risks uh and and the the the answer is is in a in a sentence is we create a workflow around this contained system which is a general purpose way of of using it to do autonomous R&D of special purpose agents that have quantitative safety guarantees relative to specific contexts of use so it’s like we do autonomous AI R&D but not on AGI so it’s not doing like a self-improvement but if it’s if it’s capable of doing that then it should also be capable of doing autonomous R&D on special purpose agents for for particular tasks at scale and so we can do it for for many many many special purposes and that’s what I’m hoping to demonstrate in the program and these special purposes what would be an example here yeah so some examples include discovering new materials for Batteries determining a control policy for when to charge and discharge batteries based on the weather forecast for solar and the demand forecast on the power grid optimizing clinical trials so that they can be faster and more efficient manufacturing monoclonal antibodies which right now are made in bioreactors that need to be supervised by PhD biologists and sort of steered in a in a healthy direction so if we could automate that and the quality control associated with that it would bring down the cost of monoclonal antibodies by orders of magnitude and they’re super useful for lots and lots of diseases and they’re just not used right now because they’re so expensive to make could be things like automating the the manufacturing of factories that manufacture other things like metagenomics monitors for all of the airplanes in the world that to be monitoring whether there’s any new pathogens showing up in their in their atmosphere on board so basically there are things that help with with Cent climate and health and uh and that provide economic value and manage energy and there are also things that help with some of the attack vectors that Rogue AI might use like bio threats and cyber threats so creating formally verified software and Hardware that does does not have exploitable bugs as another example so this is both about capturing the upside of AI but also about limiting the downside by having these specialized agent perform their purposes yeah so it’s about getting getting a substantial upside straight off the bat and then also about using this critical period to help improve civilizational resilience to Rogue AI that might emerge later how long should we expect that middle period to be because as things are right now I feel and I I’m guessing many listeners feel that these periods are getting squeezed together and we might not have much of a of a middle period what do you expect so I think it’s appropriate to have radical uncertainty about a lot of things in this space including these kind of like timeline predictions I think it is plausible that even on the default scaling pathway if we think about it as you know one order of magnitude every year in in effective compute as a as a sort of pessimistic approximation I think this middle period where you know it’s it’s much smarter than the top humans but not smarter than all of humanity put together at 100x speed you know it’s that’s like six or seven artist of magnitude so I I’m I’m thinking my median expectation is sort of six or seven years in this critical period on the default scaling path but it’s also important to note that if there is a way of getting a lot of the economic upside and if there is a way of resolving the Assurance dilemma of knowing whether other other people in the world are following the same pathway that you are then it might it might well become a stag Hunt game as opposed to a prisoners dilemma game where there is a an equilibrium where everyone decides let’s just stay with this way of doing things for for a little while longer it depends on your discount rate so if people use economic discount rates which are now you know around 5% per year and that’s that’s how you value the long-term future even then you can kind of make an argument which I do in in the program thesis that with a successful program it would be an equilibrium to kind of hold off on going all the way to full super intelligence for about 12 years so that there there’s that angle and if the if the discount rate becomes kind of recognized to be if it becomes recognized that it’s important to have a lower social discount rate which is starting to be the case for climate change that is not tied to the economic discount rate then it could be an equilibrium to hold off for much longer and it will be easier to convince companies Nations and so on to hold off on superintelligence when you can present these benefits of more narrow systems uh that include kind of safety guarantees so you can capture some or much of the upside without taking the risk of developing highly Advanced systems too early yeah that’s right I mean if people are looking at 10% GDP growth as as an option it’s not an equilibrium to say let’s give up that option entirely but if we’re saying okay we’re only going to do special purpose things that are possible to Define formally and that only covers 15% of the economy then that suddenly becomes viable what what role does a gatekeeper play Within the safeguarded AI program yeah so this is a a little bit of a complicated question because it get it gets into the details of all of the the the different boxes of formal verification is the gatekeeper is kind of the innermost box of verification we can go through the whole stack by the way let’s let’s go through the whole stack let’s start start from the from the outermost the outermost level so at the outermost level we’re saying we want to keep an AI system contained so we want to have a formally verified Hardware system that runs a formally verified operating system that maintains invariant of integrity and confidentiality and confidentiality and integrity are the way that cybercity people think about it but if you’re thinking about it from a containment point of view those are actually output side channels and input side channels so if it’s if there’s a way of information leaking to the outside that’s a confidentiality violation if there’s a way of information leaking from the outside to the inside that’s an actually an Integrity violation so these types of properties which people are starting to in you know confidential Computing starting to actually integrate uh with gpus are also important for containment so the outermost layer of formal verification is we sort of verify these bounds on information in and out and make sure that that we have that controlled yeah maybe here say a bit about what it is you mean when you talk about formal verification what is what can we guarantee formally about these systems by formal verification I mean using mathematical proofs that are grounded in assumptions which come from either a model of the design of microprocessor or if you’re verifying the microprocessor then the assumptions are grounded in in physics it’s solid state semiconductor physics and you verify that the design of the microprocessor if if it is implemented in a world that satisfies these physical principles will have behavior that aligns with its specification uh and then on the software level you sort of assume that the microprocessor behaves according to its specification which is a bit more of an abstract kind of axiomatic system and then from from that axiomatic system you deduce that if you put this operating system on it and then regardless of what programs you run on that operating system the confidentiality and integrity properties are in invariant so what we’re doing there is is we’re saying if our understanding of how computers work is complete and we’re not missing some some extra scientific phenomena then these confidentiality and integrity properties will hold with you know some exponential probability exponentially low probability that of of a failure that’s great okay then let’s move to the next layer of the stack yes so inside of the containment system then we’re running this powerful a and we want to get something out of it so we don’t want to contain it 100% but we also want to get as as few bits of Shannon information out as we can manage so we don’t want the system inside to be able to freely choose lots of bits of information to communicate to the outside but if we’ve specified a formal problem that has a unique solution so unique optimal solution to an optimization problem for example then what we can ask what we can require of the powerful system inside containment is to produce not only that solution but also a proof that it is correct and unique then Inside the Box we can check that proof and this is the next layer of formal verification so verifying that this proof is correct that it is indeed a proof of the correctness and uniqueness of the solution and then if that check passes then we output the solution we don’t output the proof because the proof won’t be unique so that there will be lots of ways of doing steganography into into the proof itself uh and there are ways of saying okay we if we don’t trust the proof Checker inside the box to not be corrupted then we could redundantly have the proof encrypted and sent to five other clusters around the world where in a secure Enclave will run different implementations of the proof Checker and then destroy the proof once we’ve checked it it’s correct but basically the point of the point of all of this verification is is threefold I mean first it’s like we do actually want to know that it’s correct we want to avoid conf fabulation and hallucination which might which might matter even for for models that are capable of quite powerful Behavior they might still be less than fully reliable at exhibiting those capabilities um but if they’re if they’re powerful enough to be worried about you know causing a a global catastrophe probably they are pretty reliable so it’s it’s not as much correctness it’s more these other two benefits from formal verification the second one is giving us Humanity confidence that it is correct so you know even if it was going to be correct anyway we would have a hard time distinguishing that from the world in which it was scheming uh and and it so it helps with your game theoretic coordination to to have the the check and then the third benefit is reducing C the causal empowerment so be because we’re verifying uniqueness or we can verify something kind of close to uniqueness we can do quantitative bounds on the transfer entropy we’re reducing the the amount of of causal power that the system inside the box has so if it’s only discovering the the elaboration of a mathematical object that was well defined when we asked the question and we we just weren’t smart enough to find the answer then it has very little ability to change the world in a in a direction that it it wants to even if it wants to do something strange are we now at the point where we can talk about the gatekeeper we are now up to the gatekeeper so the next layer of the next layer is what is this solution that we can formally verify Inside the Box will be correct and and safe when when it comes out and my null hypothesis for what that solution would look like although we also may just allow the super intelligents to figure out alternative architectures but my null hypothesis is that it consists of a blackbox neural network that over which the safety proof would be like universally Quantified and a backup controller which is a smaller neural network that is only responsible for maintaining the safety stability invariance and and then a gatekeeper and the gatekeeper is doing runtime verification which is the third this is now the third layer of formal verification is at runtime after you’ve extracted your special purpose AI from the box the special purpose AI actually is is verifying its own actions and and the verification questioned there is over the next time step is the probability of transitioning to a non-recoverable state low enough and if it’s not low enough then you switch to the backup controller and the backup controller will always with a very high probability be able to maintain from a recoverable State the safety property so if you ever had an unsafe action that was proposed that would take you outside the recoverable set then because at the previous time step you had proved that you would still be in the recoverable set then it’s not too late at that point to switch over to the backup controller so you can make an argument that’s basically a mathematical induction argument that the you you assume that in the initial State you’re safe and then you can show that if you’re safe at State n then at State n plus one you’ll be safe with high probability uh and so you can make this induction argument statically before deployment before deployment of the special purpose agent but the the induction argument fundamentally relies on doing a runtime check which you couldn’t really factor out because the the state space is is exponentially large in the in the dimensionality so you couldn’t have just upfront said no matter what happens with the you know with the advanced controller we’ll always be safe and now we don’t need to check anymore but you can verify upfront that the runtime verification will keep you in the recoverable set one question that arises here is how far are we from implementing

16:28 Implementing Safeguarded AI

this system or the stack you just described how far is this from from reality how much of this is theoretical versus implementable right now this is mostly theoretical right now in the ARA program on safeguarded AI the set of people who is working on it is currently about 57 mathematicians in the theory technical Area Technical area 1.1 so it’s mostly Theory and what we need to develop is a language for representing real world problems as formal mathematical problems that is sufficiently expressive to cover all of the science and engineering domains that we think is possible to cover and that are important either economically or or for resilience so it’s a little bit like a probabilistic programming language but that includes continuous time systems and partial differential equations and other modeling Frameworks that are not easily expressable as probabilistic programs with finitely many variables and but that remains declarative so that we can do formal reasoning or so that so that we can expect that a super intelligent system would be able to do formal reasoning about it and prove to a human engineered Checker that its formal reasoning was correct so I think it’s you know if we’re successful in the program then it’s we’re going to have a demonstration probably sometime around 2028 of being able to do this in in a few different fields and and actually get what I’m aiming for is is economic value on the scale of a billion pounds a year from deploying these system deploying each of these kind of example example problems and and what what is an example of success here if you’re able to create this this kind of verifiable language for a domain in which we we are not able to to do formal verification yet what might this enable could you give an example there yeah so it could be much better more efficient balancing of the electrical grid right now the balancing costs are on the order of 3 billion pounds a year which has gone way up it was more of like half a billion pounds a year 10 years ago it’s gone way up because it’s much harder to balance a grid that is much more renewable heavy but as energy storage systems come online it should be possible to kind of use AI to intelligently move the energy around in time but right now ai systems that are not safeguarded are reasonably not trusted with these kinds of decisions one worry I have here is that I expect that a very small number of people will be able to understand the the proofs that are that are given by by these various layers of the stack and so in some sense the rest of humanity will have to rely on our trust in in those experts is there any way to to to make these proofs kind of to explain these proofs without degrading the the property that that they are kind of like formally verified first just to separate the proofs from from the specs the proofs if they are mechanically checked and and if the proofs are in fact either real time and happening in real time and and and and hugely high-dimensional or inside the box and uh we don’t want to actually leak them because they might have steganography then no one’s going to be reading the proofs we’re going to be relying on on on actually machine checking the proofs but what we will be relying on humans to audit is that the proofs are of the right theorem so the the that the theorem actually covers the safety properties we care about in real life is that an easier problem it sounds to me intuitively like an easier problem still a hard problem but but is that an easier problem I think it is a much easier problem I think in in general it is it is much easier to in in engineering to Define what you want than to actually find an implementation and and to show that that implementation is correct but yeah it’s not trivial so I think we’re going to we’re going to want to engage assistance AI systems that are at sort of present day level or sort of you know non-catastrophic risk level to help with formalizing those specifications but again we can’t rely on on on those systems to be perfect perfectly reliable all the time so we need human human Audits and so I’m imagining having kind of large teams of humans who sign off on little little pieces kind of like the way that a company like Google or Amazon does code review at a really large scale every piece of code that goes into their enormous code base is signed off by three different humans so a little bit like that kind of building up a huge model and then sure I mean if it’s if it is architected in in a in a in a good way in a way that will be scalable that will involve and modularity and it will facilitate Exposition as sort of explaining for each module in in a blog post form like here here’s what’s going on um and you could maybe rely on AI to generate the exposition and and then again just have a human check you know is this Exposition makes sense another thing I think is is really useful here is because large language models and as they become multimodal also are models basically of human response to stimuli we can optimize for finding trajectories in the state space that satisfy the current safety specifications but would be distressing to a human as modeled by an llm and and then kind of surface that and and say okay humans I notice that you’re you’re specification might be missing this Criterion because uh here are some trajectories that satisfy the current specification that you I think you wouldn’t like so we can use that as kind of an iterative process to to help refine specifications if this uh kind of technological development goes as previous uh techn technological changes have gone I would expect that at some point the experts the technical experts will have to present something to key decision makers in government and so on what would be the challenges in doing that this is somewhat of the same

22:58 Can we trust Safeguarded AIs?

question that I asked previously but how how would you truthfully communicate the correct level of trust we could have in these systems given your given the technical results so Arya is kind of like the the old pre-1971 arpa this before it became DARPA but we’re we’re we’re sort of an advanced research uh projects agency Advanced research and invention agency similar name and we gave out grants we we sort of formulate programs of research put a vision out there and give out grants and to unpack it a little bit more people come in individuals are recruited and hired as program directors and a program director then comes to Define what we call an opportunity space which is a a zone of kind of research and development potential which is smaller than a subfield and which is currently being being neglected but seems worth shooting for and kind of ripe for progress and and would be would be transformative if it worked out and then within that opportunity space defin something even more specific which is called a program and so in my case my opportunity space is mathematics for safe AI so that also includes like singular learning theory and agent foundations and like there are other other ways of trying to use mathematics for safe AI but within that I’ve then defined this program around safeguarded AI which is this very specific bet on you know a particular stack of formal verification tools and So within the program what we do is we make calls for proposals for certain thematic elements of the vision and then people make make proposals and we we give out grants for for for people to do the research so unlike NASA where we would do a lot of R&D within the organization itself for example so aret does not do that okay so that being said Arya also is not a regulatory body or a regulatory advising body even the way that say the AI safety Institute is so I interact with AI safety Institute a fair bit I but I’m sort of giving my perspective on where the puck is going if things go well in in three to five years so they have a work stream at the AI safety Institute on what are called Safety cases there’s a new paper that came out recently from I think Center for governance on AI which they’ve linked to safety cases for AI and so safety cases are a wellestablished practice in high-risk Industries like nuclear power and passenger Aviation where there is some set of assumptions which are not verifiable but there’s evidence for them of various kinds like empirical testing or some AR prior argument about about how physical materials behave and those those assumptions are then combined by arguments that are sort of like deductive proof steps that that they’re like inference rules that are that are also part of the safety case and you say okay well if I make these assumptions then I can I can deduce this claim and this claim is is now recursively evidence along with these other claims to deduce this higher level claim and eventually at the top of that tree of the safety case is the claim the risk of fatality is less than one per 100,000 years and and then that’s sort of the quantity with with respect to which there’s a regulatory requirement there’s like a a curve of fatalities versus number of times per Millennium that is allowed to that’s allowed to be or that’s allowed to be sort of subjectively uh deduced to be from assumptions that are also challenged by by Regulators so that’s the safety case Paradigm so the AI safety Institute is is exploring that for for multiple different AI safety solutions that are sort of related to these different critical periods where the primary risks have different character and I’ve been starting to work with them on what would a safety case for safeguarded AI look like um so I don’t have an answer on that but it is something that we’re thinking about and those this in in a kind of situation in which we have these more advanced quantitative safety guarantees those numbers would be the numbers that technical experts could present to high ranking government officials uh when when they’re making decisions about you know should this system be allowed to run should we should we deploy this system decisions like that yeah that’s right we also have a sub area of the program on socio technical integration which is technical area 1.4 which is actually live now so if you’re listening to this and you’re interested in in in that in this question take a look and see if you might want to submit a proposal one of the things in there that we’re looking at is how could we support how can we build tools to support multi-stakeholder Collective deliberation about risk thresholds and trading off risks against benefits when people have different opinions might be interesting to talk a bit about how the AI case compares to the case of Aviation or nuclear power because it seems to me that that it’s AI is more complex and it it touches on basically all aspects of a society and also it we seem to have not as much time because in the in the aviation case we might have had say a 100 years to get this right but it doesn’t seem to me that we have 100 years to get the AI case right maybe you can talk about the the kind of challenges we’re facing there and and what you think about our our prospects of success it is true that in nuclear and in in in in aviation many fatal accidents occurred before this safety these kind of safety techniques were figured out right and if we’re if we’re talking about AI at a catastrophically capable level we can’t afford to have many fatal accidents um so there is a there’s a sense in which this is a harder problem to approach we do have the advantage of having mature safety practices in other Industries which a hundred years ago when Aviation began there just weren’t really there was civil engineering a little bit but that was for static systems and not Dynamic systems and so now we have kind of mature safety practices for dynamic systems but as you say only in narrow narrowly scoped contexts of use and and now we’re thinking can we come up with some kind of safety practices that would apply to Dynamic systems with open-ended unbounded contexts of use I do think that’s very hard and that’s the reason why in safeguarded AI we’re focusing on a scalable workflow for generating safety guarantees in well-defined scoped contexts of use so we’re saying can we take those practices for safety critical Engineering Systems that only apply when you have a well- defined context of use and just make those practices really fast and easy for people to engage in with lots of AI help and and do that across many many contexts of use which would maybe not all of which would be considered safety critical today but we’ll do it this way anyway to to sort of limit the amount of of dangerousness that the AI could sneak through in different contexts but what that implies is we won’t be using safeguarded AI to create a chatbot product where you could just ask anything you want we’ll only we’ll only be using the kind of pre- dangerous pre- catastrophically dangerous systems as chatbots to help us formulate the problems that are too hard for those systems to solve for the much more capable systems to solve in a way that it has has these layers of formal verification and containment and and as you described if I understood you correctly you’re also working on

31:00 Formalizing more of the world

extending the areas or the aspects of the world over which we can do mathematical proofs and kind of the areas of of the world that are well defined enough for us to to do these proofs and this this formal verification could you is that kind of a large part of what what you’re doing in the program and what are the most interesting aspects what are the most ch challenging areas to to formalize yeah so for now it is a large part of what I’m doing in the program is is is just expanding the scope of what what types of problems we can formalize and towards the end of the program it’s going to be much more there’s going to be much more machine learning we have a call for expressions of interest that is live now for technical area to which is where the machine learning will take place uh and and we could get into that more later but the theoretical aspect is it’s less about sort of expanding the space of what we could do proofs about and more about unifying different areas of of Science and Engineering where in chemistry or epidemiology you might use Petr Nets and in civil engineering you might use finite element methods and partial differential equations and in electrical engineering you might use ordinary differential equations and fora analysis and so just take taking all of these different modeling ontologies you there’s not that many but you know somewhere between 12 and 20 kind of distinct ontologies that we want to unify and then the other big piece of it actually on the theoretical side is managing uncertainty that is more radical than basian uncertainty U so we call it imprecise probability it has arisen in a bunch of ways and I guess I should say you know because a lot of the audience probably will be pretty Orthodox basian there have been there have been a lot of kind of ad hoc attempts at fuzzy logic or like Dempster Schaefer Theory or prospect theory or things like that try to try to be more imprecise than and then probability and they they they do all kind of feel a bit janky and and I and I I I appreciate that what what I have seen is a handful of things that have emerged from very different directions very different motivating considerations that have all converged in the sense that they’re mathematically equivalent even though it it feels a little bit hoc to say oh the epistemological state instead of being a probability distribution is now a convex downward closed subset of sub probability distributions like why is it why those particular adjectives in that particular order but it turns out that if you look at the infra basanis theory that was motivated by agent foundations from fesso and Alex Appel and and you look at what their epistemological state is the homogeneous Ultra contribution it’s equivalent to this if you look at Joe h ‘s decision Theory from from 12 years ago that was motivated by hey it seems like sometimes experts disagree about the probabilities of events so how can we just have a decision theory that doesn’t resolve those disagreements and just keeps them as unknowns that turns out to be equivalent to this there’s Paul christiano’s definability of Truth paper from from 2013 which is motivated by reflective consistency and that turns out to be a special case of this there’s papers in econom ICS there’s programming language Theory there’s a pure category Theory kind of thrust where people are like these monads don’t commute but they should commute how can we kind of like weaken the notion of distributive law and figure out like the bet the way in which probability and non-determinism can can most closely almost commute and then they found this so I think there’s like this relatively canonical thing which is the next step in kind of Humanity’s understanding of decision-making under uncertainty um and that’s a big part of of what we’re doing here also I think in general if you have a basian belief State you have assumed too much already even even just in a prior a prior is giving you precise prior probabilities for every possible hypothesis and and that’s and and people always say oh where do you get your prior from you know don’t ask that’s never ask aasian where his prior comes from never ask a woman her age kind of thing and and I think that’s actually I think that criticism actually is right and it just has taken mathematics a while to kind of discover the The Next Step but that’s a big part of what we’re trying to push forward sounds very interesting it also sounds like this is something that I think happens sometimes in in in science where you will you will discover different approaches converging to the same solution but then oftentimes or sometimes I think the new paradigm is not yet fully developed and and kind of usable yet and so we might we might have to be basan for a while longer or basan for some part of the domain we’re working in do do you agree or or do you think we’re further ahead so that we might be able to have a new decision Theory that’s non- basent so I think the theory is is is very close I think we might have the theory worked out in a year and then I what I’m hoping for is to have an implementation worked out in you know two two two more years so I I think it’s I think it’s not far but I I also would admit we don’t have a usable implementation yet would this approach be critical to safeguarded AI for the whole program to work that you succeed in in developing a a new or more advanced form of decision Theory I would say it is still useful to do the rest of the program and require that the safety specification be making a basian prior assumption still better than the than the status quo in terms of of safety AI safety for this critical period but I would feel that most of the safety benefits would be sort of compromised by needing to make a Basi an assumption because there always will be a gap between the distribution that you assume and and the reality and if you if you can’t cover that Gap by being agnostic and and and making use of NY and uncertainty then the the system in the Box would be able to exploit that Gap and that that could be a big problem if you are one of the big AGI corporations like open AI

37:34 The performance cost of verified AI

Deep Mind or anthropic and you’re thinking about should we try to implement a stack like a safeguarded AI what are the costs associated with trying to implement something like that will it for example take them longer to train their models will it will it be more computer intensive when you’re when you’re running the models and so on so I think probably the biggest cost is that it’s just a very different Paradigm for not a different Paradigm for like machine learning it’s not like an alternative to Transformers or something but a different Paradigm for the business model it’s not that you just have one product that you serve on the web it really needs much more user interface tooling and even institutional design and many different formal verification systems that that are stacked on top of each other so it’s a it’s a very different way of creating a product around an AI system in terms of the AI system itself my guess is that the pre-training will not be that different or will not need to be that different the post trining will be different but probably the post training will be more cost effective if if if I’m right and this Paradigm works probably you will actually get more capabilities Inside the Box by interacting with formal verifier that is coupled to real world problems than by interacting with another large language model that’s trying to judge whether you’ve followed the Constitution and can can you say why that is because that sounds like a dream scenario in some sense if you can couple safety to capabilities and align incentives such that you could also get the capability without the safety so it’s not it’s not that it’s not that as good as it sounds and and that’s part of why we’re looking for an organization that has an unprecedented governance robustness to develop the machine learning part of the safeguarded AI program but the reason why I think there could be strong capabilities here and it’s related to if you look at the progress that has been made this year over the course of 2024 as the non-synthetic data that’s used in pre-train has sort of reached its its limit sort of reached Peak data Peak tokens what we’ve seen is a lot more progress in math and coding where tests can be done in quote unquote simulation soort in cyberspace because those sources of feedback scale with compute whereas human feedback and you know feedback on creative writing or on philosophy does not scale with compute and in in fact right now now feedback on electrical engineering or mechanical engineering does not scale with compute but with the safeguarded AI tools they would so in some sense what we what what I’m hearing is that right now mechanical engineering for example would be closer to something like creative writing or philosophy but what we’re trying to do is to make it closer to something like like math basically that’s right yeah yeah yeah all right so you mentioned potentially partnering with an organization that could try to implement the safeguarded AI stack what would that look like and and what role would would you be playing what role would Arya be playing in how would such a partnership work yeah great question so as always ARA is a grantmaking organization so what we are proposing to do is to make a grant of 18 million pounds is the sort of amount that’s earmarked for uh technical area to be a single Grant to a single organization the process is roughly right now we’re looking for expressions of Interest which is basically just your name and like a few hundred words at most about why you’re interested and then at some point when we are confident that there is enough interest then we’ll move into phase one a formal call for proposals where the proposals are brief proposals that if successful you know we would fund three or four of them to go and spend a few mon mons developing a full proposal because actually designing the governance mechanisms is going to take time and and we don’t want to just make one bet on that we want to if possible make multiple bets and have multiple parallel efforts to try and figure out what this should look like and we are not really prescribing the answers we’re going to pose the questions the questions like what is your legal structure what are your bylaws why would we expect that the incentive of the people who are part of the decision-mak structure for safety critical decisions would be aligned with the full consideration of both positive externalities and negative externalities for Society at large and then we’re looking for for for arguments of why some particular concrete legal structure and decision-making structure would would satisfy that desideratum and then if if we if we get good answers to these questions then what we would offer is a a non-dilutive research Grant of18 million to the new or newly created organization or it could be an organization that’s created as a sort of affiliate of an existing organization but we don’t expect that any existing organization with its existing legal and decision-making structure that is located in the UK would sort of meet meet the the the bar that we’re setting here so it would be probably a new entity in some form then we would Grant we would we would Grant to that new entity 188 million pounds and and and and the contract would be this is for doing the machine learning research that’s part of the safeguarded AI program that organization could then also pursue other agendas at the same time could raise funding from other sources there’s nothing exclusive about this I wouldn’t even really consider it a partnership it it’s just it’s just funding we just want to support something that has a more robustly beneficial shape to do the development of capabilities that is necessary for safeguarded dayi to work are you optimistic that we can or someone can design an an institution design a kind of a legal framework that would satisfy what you you’re after here I mean the the the legal system is not close to being formally defined and so this is in some sense the implementation would be hinging on something that that is more fussy than than the the technical aspects of your program how do how how optimistic are you here um I’m I’m moderately optimistic I think In fairness there has been a lot of progress in the last 10 years on improvements to the sort of organizational governance structure of Frontier AI labs and so I think basically there’s just room for more Improvement there’s room for more progress on this do you think that implementing safeguarded AI would prevent AI systems from acting autonomously or would would limit them in certain ways that would make them less economically viable uh so the main applications that we’re considering are autonomous agents in but in very specific domains so like an autonomous agent that makes monoclonal antibodies doesn’t do anything else so I wouldn’t say the distinction is about autonomy I would say the distinction is about generality and and perhaps about the ease of integration we’re not going to be able to make drop in remote workers with safeguarded AI so that is a limitation I would say there’s a minority of economically valuable work that can be defined or that would be worth defining in enough detail to solve those problems with safeguarded AI so I think the case that needs to be made is not like you know shut down the chat Bots and pivot entirely to safeguarded AI only but it’s more like above some capability levels it just doesn’t make sense to to deploy as a chatbot and right now in the sort of responsible scaling policies and the the critical capability you know Frontier safety Frameworks and and and so on the frontier AI safety commitments I guess is the general term that’s being used now that companies have made as part of the AI safety Summit Series most of them say at this critical level where it’s capable of autonomous Cyber attack and autonomous R&D and persuasion we don’t know what mitigations we would apply and so we would you know if we get if our evals get a a high level of risk on these we would stop I guess is basically what they say um now you might be skeptical that in fact they would actually stop at that point but what I’m offering is an economic improvement over what they have claimed that they would do which is instead of just stop you say okay okay we’ll put it in a very strong box and we’ll only use it for formally verifiable problems to sort of generate autonomous agents in specific domains where we can get quantitative safety guarantees yeah and I think it would be it would be reasonable to expect that demand for a product or or Vision like like safeguarded AI would would increase in in a scenario where decision makers have to either kind of stop completely or gain some of the benefits by proceeding more safely um so so you’re you’re also thinking or it seems to me that you’ve thought right from the start about say the game theory of this this this whole situation which is often a missing piece I think it’s interesting that that you might have something that works in theory and it it could be extraordinarily beautiful and and and so on in theory but if it’s not kind of aligned with the incentives of of key decision makers in the real world it might not be that useful so I think that’s something that should be applauded yeah how do you expect this to

47:58 Changing attitudes towards AI

develop do you think that which events or which situations would change the political mood or sight Guist such that there would be more interest in in something like safeguarded AI yeah good question roughly I think the perception of the perception of the potential for catastrophic risks needs to be higher and that that was somewhat matter why why the perception of the perception yeah so in the game theoretic equilibrium kind of analysis it’s almost an assumption I would say of classical game theory that the stakes are common knowledge if the stakes are not common knowledge then it’s very hard to be confident that you know that your opponent is is aware that they’re playing the same game that you are and so if in in in the in the AI case you know to make it explicit if you are not confident that the opponent is aware of the risk then you would assume that they’re going to kind of surge ahead and take the advantage thinking that it’s a prisoners dilemma when it isn’t and if you’re not confident that your opponent is confident that you know what the stakes are then you will think that they will be trying to preempt you because they don’t think that you understand that it’s not a prisoners dilemma and so on I mean there’s only three levels of meta as as yudkowsky law says it doesn’t go infinitely deep really but but I think you do need to have those that kind of real world common knowledge of sort of okay we’ve all kind of declared that we we take this seriously I think that common knowledge exists between the leaders of the frontier Labs it doesn’t exist between all of the key decision makers and all of the frontier labs and it certainly doesn’t exist between all of the key decision makers in all of the governments so I think that’s a big thing that it could be that there’s just new scientific evidence and this would be the ideal case is that there’s new scientific evidence like uh uh people talk about the potential of a model organism of of misalignment in an autonomous way so some kind of demonstration at a scale that is not itself dangerous of the shape of of the danger um and that could be very compelling I agree that it might be interesting but it just it has connotations for me to gain a function research and and it would be it would be extraordinarily ironic if if we ended up in a situation in which Humanity harmed Itself by demonstrating a danger that we want to prevent do do you think this can be done safely I I guess is is the interesting question yeah so I think that the sort of formally verified containment system is also helpful uh for this so I I think there is something a little bit nerve-wracking about the way that people are doing Cyber attack evals right now in you know in a Docker container in a VM in a virtual network but in the cloud and you know the way that I think it was 01 preview sort of escaped one of those layers of containment completely unexpectedly um and everyone said oh that’s interesting good thing we had two more layers of containment so so it does that does worry me a little bit it would be nice if we had one of those layers of containment being4 or you know something formally verified which does exist you know4 has been free of exploitable bugs for 10 years um so it’s it’s it’s a pretty concrete direction to say Hey you know we should have NVIDIA drivers for 4 so that you can actually run AI systems on4 clusters so that’s something I’m I’m I’m talking to talking to some people about but I would say that is not part of my program it’s not in scope of what I’m doing so it’s it’s still neglected that’s a neglected Direction but yeah I think there are ways of running tests safely I don’t think that we’re quite there yet but I think the the toolkit exists it’s sort of like the way that cyber security researchers can safely handle state level malware and do analysis of it and it just requires very high skill level at being careful which I think is is not yet being applied to to AI risk evaluations but probably probably will be before it’s too late but uh yeah on on the margin doing more work on making sure that it is good and and the idea here would be to present again key decision makers with uh results of where a system has has shown that it can do something that’s that’s quite dangerous but where we’ve contained that system so that it it actually didn’t cause cause harm that’s right makes sense okay maybe here would be a good point to talk about

52:39 Flexible‬‭ Hardware-Enabled‬‭ Guarantees

flexible Hardware enabled guarantees where what is the yeah let’s let’s start from the beginning what what do you mean by flexible Hardware enabled guarantees this is a form of an Hardware enabled mechanism for compute governance the purpose of it is really to resolve the Assurance dilemma of how can you be sure that a competitor who you don’t trust is in compliance with an agreement any agreement that restricts the AI development Pathways at all so it could be you know we’re going to set some Maximum you know no training runs larger than 10 to the N flops but it could also be something much more that sort of stratifies training runs according to various levels of flops and then runs capability EV vales that are more and more intensive at different levels and then stratifies according to capabilities and then you have have to demonstrate mitigations and and there’s stratifications of how strong the mitigations need to be and and then those mitigations need to sort of be in accordance with risk thresholds so that there’s some you know like a safety case argument that at the top level there’s some very very low probability of anything that might cause a thousand fatalities or whatever so that is the reason that I’m going for flexible Hardware enabled governance is because we don’t know yet and and and sort of societies and governments are not yet in a position to to to make long-term decisions about what the shape of those agreements should be so that that needs to come later so when when when people say you know we don’t know enough about science to to decide yet like what the rules of the road should be there’s some truth to that we don’t we don’t know enough to decide the details and so what I’m aiming for here is to develop Hardware because Hardware has such a long lead time I think it’s quite urgent now even though it is not time to have any kind of international agreement about about AI it’s time now to develop the hardware platform that would suffice to guarantee Assurance for whatever that agreement turns out to be years later after the hardware is built that seems like a very difficult problem to to handle when you don’t have you don’t have the specifications of what it is you want to guarantee how are you measuring progress here the lovely thing is what we’re trying to guarantee here is is is about compute and compute is universal so so one way of answering the question you know it’s more complicated than this but one way of answering the question is just well we have a co-processor which which runs the the policy and decides whether whether to P whether to permit proceeding with running a code uh and then that co-processor can be updated according to a multilateral decision-making process so that that’s basically the answer and it’s it’s it’s quite different in both positive and negative ways from previous Assurance dilemmas that have been resolved around chemical weapons and nuclear material it’s it’s less favorable than those in that computations do not leave a distinct physical Trace uh a signature of one computation versus another of course they consume a certain amount of energy but lots of things consume lots of energy that’s not a very distinctive Trace that that you can really track and it certainly doesn’t help you distinguish between two different computations running on the same Hardware that have different you know training versus inference or something the but the on the other side of the coin and I think much more significantly the favorable thing about about trying to govern compute is that the compute itself can be programmed to do the governing not not with current Hardware it does require some tweaks but there would be no way no conceivable way to sort of take a kilogram of uranium and configure it such that it refused to be incorporated into a nuclear weapon and only would be only would be used peacefully but if you have a Computing system that has hardware for detecting tampering and disa aing itself in the case that it’s being tampered with and for analyzing what it’s being used for and doing cryptographic verification and coordinating with the other parts of the system to verify that the whole cluster is being used in accordance with with the rules then you can do a lot more than the kind of traditional approach that relies on physical spot checks and and and and supply chain and and tracking where everything is going geographically in fact I think it’s not necessary to track where everything geographically or to have surveillance and monitoring of what everything is being you know what all the computer is being used for I think the hardware governance systems that rely on that that are basically Hardware surveillance systems that can then be used to do verification in a centralized way have serious downsides you know people individuals particularly in in in the US would object on the basis of individual liberty on the international scale countries would object on the basis sovereignty um so trying to provide a technical option for compute governance that is actually completely privacy preserving and and and even to some extent Liberty preserving in that there should be certain kinds of computations that you can do regardless of what the multilateral governance body decides some kind of Safe Harbor like well these computations are small enough that there is no way that they’re a catastrophic AI risk so there’s there should be no way of remotely disabling devices entirely or sort changing the rules to be arbitrarily restrictive but there should be ways of of of changing the rules to some extent for large computations and adapting as we learn more about what types of risks emerge at what scale and how to evaluate them and how to mitigate them can you say more about how your system would preserve privacy and Liberty and and especially also you mentioned that it’s not necessarily it’s not necessary to track where all the advanced are for example or whether they’re being used in in large training runs maybe say say more about that so basically the principle is that the entire Loop of like one figure out what’s going on two decide whether it is in in compliance with some set of rules and then three if it’s not do something about that to stop it from happening that entire Loop can happen within the device and in a way that is resistant both to tampering by the physical owner and to exploitation by the the sort of decision-making process that that adjusts things over time so that that I think that that’s basically the answer the the information that’s used in that process of deciding whether something is safe or is in compliance with safety principles doesn’t need to leave the device if the device itself can be trusted both by its owner and by the World At Large to fulfill its intended function what is something that we could prevent using flexible Hardware enabled governance what is something could we could we prevent specific capabilities within within systems for example it’s possible to build in evals into a policy and say if you’re above a certain threshold then every every 10 to the 20 flops of training that you do you have to run this this Suite of evals and if this Suite of evals uh comes up red then you have to pause and and and then there could be some other there there are various ways of kind of handling cases where where evals are concerning or where the policy sort of says oh this is high risk so it could be that you could be that you just stop and say we’re going to now delete these we’re going have to delete these weights start over that would be the most extreme it could be that like the least extreme is to say okay well this is fine but we’re going to encrypt the weights such that they can only be run on other flexible Hardware enabled guarantee hardware and and they have to run with a speed limit and so you can only you can only inference them at you know 70 tokens per second or whatever that would be like the least extreme kind of restriction and then there is a whole spectrum of of of things in between and really you know the the the hope is that there are lots of things that I wouldn’t have even thought of that you could Implement because we’re just implementing sort of general purpose Computing system for determining what the what the rules are what do you think of the different proxies we have for determining whether a system is behaving in in dangerous ways so you mentioned one of them being being evals or evaluations where we test for specific capabilities or events that happen within the system those are of course not perfect you’re trying to capture something with an evaluation Suite uh and you you’re you’re capturing a proxy of of what you’re you’re attempting to capture same goes for something like total compute used in a training run where you it’s also a a proxy for what you’re you’re actually aiming at which is I would say a harm or or or danger within a system to what extent are is our ability to to measure what we’re trying to prevent the limit here so a couple things on that I mean one is I think it’s important to distinguish between capability evaluations and something that some people call propensity evaluations or you might call M mitigation evaluations or Safeguard evaluations where you’ve done some posttraining to the system to try to stop it from actually exercising its dangerous capabilities in practice and then you’re trying to determine whether there is still a way of eliciting the the latent capability I think those are much trickier both of them are not perfect but if you have only been doing pre-training so if you’ve only been doing Auto regressive training and then you want to determine whether there’s a capability present I think that is much easier than determining whether there’s a way of eliciting a capability after you’ve done post trainining uh and and sort of made it such that all of the simple ways of eliciting the capability don’t work because it says sorry I can’t help you with that then discovering the you know discovering the the jailbreaks is is a little bit trickier I do think there’s quite a bit of Hope in automated red teaming where you have some other AI system that’s very clever that maybe even also uses non- blackbox gradient updates and and searches for a prefix that causes the system to still exercise its dangerous capabilities but certainly the kind of standard conception of evals as like a questionnaire you just feed these props in and you and then you check whether the answers are are concerning um that does not work for mitigation evaluations it’s very easy to game but for capabilities evaluations if your training is is sort of following the pattern of being an autor regressive process that is whose objective function is cross entropy basically then I think it’s it’s not that easy to game the capabilities valuations so so I think that’s an important distinction and then I guess I would also say there’s an important point to make about soundness and and sort of adjusting the trade-off between false positives and false negatives where compute thresholds are very much an imperfect proxy for capabilities but even if we can’t reliably predict that that certain capability arises at a certain compute scale we can be fairly confident that you know catastrophic capabilities do not emerge at 10 to the 20 flops of Auto regressive compute so we we can sort of adjust it such that we’ll get lots of false positives we’ll get lots of systems that are over the limit of of concern but are actually not concerning but very few false negatives and then I think the principle should be that we just do lots of layers of that so if you’re over a compute Thresh that just means now you need to do capabilities evaluations and if you’re over a capabilities threshold well that just means now you need to do automated red teaming and if the automated red teaming finds something then that means you go back to the drawing board and you figure out a different mitigation approach which could be for example safeguarded AI in the Extreme as a a way of completely containing the capability and making it only do things that you ask for yeah what’s the motivation be behind thinking about uh compute governance like this you’ve mentioned to me before that you’re interested in in buying time to do to do the the research necessary to to make systems safe is is this approach to Hardware governance about buying more time to do Safety Research I don’t usually think in terms of buying time I think in terms of game theoretic stability and kind of stabilizing a Paro optimal Nash equilibrium where everyone follows a safer strategy and the safer strategy is not necessarily something like we will stop for n years and hope that by the end of n years there will be more progress on safety I don’t think that’s a very sophisticated strategy I think the strategy should be more along the lines of we will not deploy systems without a commensurate level of safety for the next n years and then after n years we will re-evaluate what our what our balance is and so it’s really more about stabilizing for a for a certain period of time a certain strategic orientation rather than about kind of buying time and and and pausing something until until some future date does this approach require buying from say Nvidia tsmc asml and so on the kind of major players in the supply chain of chips so a flexible Hardware enabled guarantee is use for game theoretic stability only in so far as one can reasonably assume that most of the compute that is available in the world for doing Frontier training is is enabled with this guarantee it has has this built in it is a fortunate kind of feature of the current landscape that there are there are big bottlenecks in the production of Frontier AI compute where it’s possible to kind of get some assurance that all of the frontier compute is being made with the guarantee in place so I think there there’s a bunch of places that that that one could intervene to get that kind of guarantee but in the long run it just has to be I think that in the same way that some people would think about an AI agreement as part of the agreement is you need to know everyone every part every party to the agreement needs to know where all of the data centers in the world are and needs to be able to send inspection teams to go and check out the data centers so I I I talk a lot about you know privacy preserving and and removing the kind of necessity of knowing where all of the compute is in the world but you do still need to know where all of the frontier compute Fabs are in the world and to be able to inspect those and see that they’re not making compute which is lacking the hardware enabled guarantee is there anything we haven’t covered on flexible Hardware enabled guarantees that we should cover here maybe I’ll say a little bit more about the subsystems flexible Hardware enabled guarantees that are that are necessary to make it work so we talked about the secure code processor which actually assesses whether code satisfies the current set of rules there also needs to be a a process for updating the set of rules which I think of as as being pretty closely analogous to a Smart contract um where there’s a there’s a set of stakeholders who need to reach a quorum in order to update the current set of rules the current set of rules can also change over time so it can depend on wall clock time and and there can also be restrictions on what the next set of rules is as a smart contract does it can say if you propose a new version of the rules here are the meta rules for like how that version has to be in order for it to be valid a valid update so that’s another piece there is a piece around physical tamper detection which I think has been maybe given up on some 15 or 20 years Ago by the hardware industry I think there was interest in tamper detection in the ’90s and early 2000s and eventually everyone said well it’s just always some way around it there is a sense in which I think it’s quite different from from Pure cyber security and formal verification I think it will be it it would be very difficult to be 100% confident that there are no exploitable bugs in physical tamper detection but also I think there are some technologies now that make it much much more favorable than it was the last time it was taken seriously in particular particular we have now because of 6G being on the horizon because of of Cellular Communications we have radios that are much more sensitive to like millimeter scale perturbations of of of metal so you can sort of sense from the inside almost like a radar whether there is any penetration of the of the metal case of a server there’s also just because of smartphones much cheaper sensors for you know cameras we put lots of cameras on the inside we can have little AI chips that are looking for for visual anomalies you know thermal sensors acceleration sensors and and so on I think there’s just a lot of a lot of things that can be done and we can sort of enumerate all of the possible physical attacks that one could do to try to get inside of a box and mess with it uh and and make sure that you can detect all of those before they can disable the tamper detection mechanism itself and then there’s also a sort of tamper response mechanism which is baked into the accelerator the that the AI chip millions of little nanoscale fuses that as soon as the signal from the tamper detection system that says everything is okay goes away um then all of those little fuses get burned out by charge stored in local capacitors and th then those the absence of those fuses makes the makes the chip not not useful makes it not run and it would be possible in principle to go in and repair all the fuses but it would it would be extraord arily costly you hundreds of millions of of dollars to to do that and we can then have a layer of physically unclonable functions that sandwiches the chip from top and bottom such that if you did go in and repair all the fuses and tried to put everything back the way it was it would be impossible to put it back in such a way that it would still be cryptographically able to attest to its Integrity that it would still have have the same private key uh so that’s another kind of subsystem that needs needs a little bit of R&D the the these Technologies I I would say are not exactly speculative but they’ve not they’ve not all been used together in the way that I’m proposing that they could be so that the sort of a system integration aspect to this that needs to be needs to be fleshed out but I think it’s a matter of you know maybe 20 Engineers who are very very high skilled in their respective Fields working on this for 12 to 18 months I don’t think this is like a research problem exactly it it’s more of a more of an engineering kind of system system integration problem do you think the way we develop Advanced AI will change such that total compute is not is no longer the limiting factor for example maybe Advanced systems will begin to make much more use of inference time compute and become smarter that way uh maybe we’ll develop better algorithms such that we can get the same level of capabilities from from lower end Hardware or much less compute and the worry here of course is that is that Hardware governance or comp compute governance will not be relevant in in in in those worlds yeah so a few a few points on that one is I think the inference time compute paradigms become useful after a very large amount of pre-training so I think it is fair to say we shouldn’t expect that there are going to be pre-training runs that are 10 to the 35 flops because we should expect that a lot of those orders of magnitude of effective compute will come at inference time instead but I don’t think that we should either expect that there will be ways of using infuence time compute to take like llama two and make it catastrophically dangerous so I think we need to when thinking about Hardware governance think about ways of not just limiting the scale of large training runs but even for maybe medium training runs one might say encrypting the weights and requiring the inference also take place on flexible Hardware enabled guarantee chips and have inference time governance and have basically inference speed limits and could even have a kind of token system in a different sense of token from llm token but more like more like crypto token where there’s a there’s a certain amount of a limited amount of maybe like taxi medallions you know where you need to have one of these medallions on your system in order to inference inference this big model and there are few enough of those medallions in the world that you know were not concerned about this becoming as as some people put it a nation of of 10 billion Geniuses and it’s like okay maybe you you could have a nation of 50,000 Geniuses and and we we think we can deal with that but but there’s some limit on how much inference time compute is permitted per unit of of real world time and what about the question of algorithms would you expect those to improve their efficiency to the point where compute is no longer the limiting factor my my belief about this is sort of grounded in the human brain and and and an assumption that Evolution has been restricted in many ways in terms of the materials that it can use to construct intelligence and in terms of the spatial the reliability of spatial patterning and you know the way that it it it distributes energy and okay so yes so there are lots of ways in which the brain is sort of not physically optimal but algorithms I think are not very constrainted in in terms of Evolution’s ability to access design space and the and the Brain over the course of of childhood is is doing you know somewhere around 10 to the 25 10 to the 26 flops and so my guess is that there is not that much Headroom for algorithmic improvements to go more than a couple orders of magnitude more efficient than that so that’s sort of a a a reason that I am I I tend to be more optimistic about compute governance than than others uh so definitely algorithmic progress happens and is still happening but I think it I think that that Vector of of progress will probably saturate pretty soon whereas the vector of of scale scaling pre-training and also above some threshold of of pre-training where it becomes viable scaling inference those will become the the main vectors of improved capabilities instead is it plausible that an evolved system would be near optimal I I normally think of engineered systems as as almost always better but do we have another example where an evolved system is is close to Optimal maybe something with energy consumption while walking or or or the brain sity consumption something like that yeah so I yeah I think if you look at the efficiency of the eye at converting photons into information that’s that’s pretty close to physically optimal if you look at you know photosynthesis if you look at the the way that birds Orient to the Earth’s magnetic field for a long time that was actually believed to be it must like it must be a myth because it seemed physically impossible to have that level of sensitivity um it turns out to be Quantum sensing so it is possible but it is very close to Optimal wait is is that is that real that sounds like straight out of some interesting conspiracy theory if you combine birds with something Quantum yeah no I mean I I I think people get confused about this because there are hypotheses that you know the brain is doing Quantum Computing and that’s how it becomes conscious which is which is obviously silly if you think about the coherence time like the any anything any kind of quantum coherence body temperature would Decay within a millisecond so there’s no way that a conscious A Moment Of Consciousness which is clearly at least five milliseconds could be Quantum coherent however sensing can totally happen within a millisecond so I think people get confused to say oh you know it’s there there there’s very clear first principles argument why like the the nervous system can’t be Quantum and it’s like no no it’s it can’t be doing Quantum Computing it’s Quantum sensing is completely different and and in fact I I also just for the record you know I think this is true about technology as well like I think Quantum sensing technology is much more fruitful and is going to be much more important than Quantum Computing technology for similar reasons you don’t need to maintain coherence for such a long time and what would be an example of quantum sensing kind of in product form um I haven’t thought too much about this but things like Quantum hall effect sensors for magnetic fields and you know maybe improved Quantum efficiency for photo detectors for you know Medical Imaging or astronomy interesting all right if we think about safeguarded Ai and flexible Hardware enabled guarantees and we think about those two directions or research directions programs over the next decade what is what the success look like and which challenges do you anticipate along the way success for the the safeguarded AI program itself is basically convincing key decision makers that safeguarded AI is a viable strategy for extracting economic and Security benefits from Advanced AI without catastrophic risk so it’s as as as with most ARA programs it’s about pushing the edge of the possible changing the conversation about what is feasible and what that hopefully manifests as is has some agreements that involve safeguarded AI being among the options that are generally recognized as safe for systems within some band of capability levels on various dimensions and the flexible Hardware enabled governance success case is first of all that most compute in the world 10 years from now is flex heg enabled and and second that there is a reasonable and and generally recognized as legitimate governance process for refining and adjusting the rules over time and what would be the main challenge to having most Hardware in the world being Flex heck enabled the main challenge is just that there’s only there there are a small a small number of key players where various quorums among certain key players would be sufficient icient to sort of make that happen at least for now for the next for the next few years but yeah I think it’s it’s mainly about again kind of convincing people that it’s possible right now a lot of people think that that Flex EG is is not possible either because it’s not possible to do tamper detection or because it’s not possible to do a cryptographic verification of cluster scale properties I’m very confident that the second one is possible and that it’s just a matter of engineering tamper detection is more of an open question to be honest I I think it it it could very well happen in a year or two that I become convinced after a bunch of attempts have sort of Fallen to some National Lab hackers that actually okay we can’t we can’t make a state level tamper responsiveness but I guess I would say that is probably the main challenge is is is actually making it state level tamper responsive and convincing people that that’s the case do you worry about companies being companies that are that may be opposed to to Hardware governance and maybe an example could be a meta I’m not saying that they actually are but but you could imagine that that they would be opposed to Hardware governance that those companies could simply develop their own chips inhouse and set up an an alternative supply chain that’s all vertically integrated and then have a competitive Advantage it’s very very very expensive I I don’t think it’s I don’t think it’s really a viable option for anyone over the next 5 to 10 years to develop an alternative supply chain and and I think it it becomes a worthwhile bargain to to accept Flex style restrictions in exchange for having access to the the supply chain and the technology base that exists I think that would not necessarily be the case with some of the more heavy-handed Hardware governance approaches that that involve monitoring where everything is and and all the computations that everyone’s running on it and I guess the argument that I would make to meta is to say look at some point your your Frontier models will be capable enough that it is obvious to everyone that it would be irresponsible to actually make the weights available un encrypted but you’re absolutely right that it’s important for individuals to be able to run their own language models and have privacy and to some extent control over the system propped and so on and to be able to fine-tune and customize to you know to their language and so what’s the way that we can kind of have the best of both worlds well it’s something like this you have a secure Enclave on the on the processors where you can distribute a model freely in terms of free of charge but in a way that’s encrypted it can only be run on Flex enabled chips and and it runs with built-in safeguards that can’t be removed but also with complete privacy and with some level of customization fantastic I I think this concludes our our conversation on on your most recent work and now maybe we could because now that I have you uh we could move into some of the other work you’ve done in the past and talk about your your life story also because there’s a lot of interesting things going on there um you you’ve done work on brain uploading in the past that’s

01:24:15 Mind uploading

that’s been an interest of of of yours maybe a decade ago you mentioned how has your perspective on on brain uploading evolved or the the possibility of of of that approach working at all this is this is a long time ago a long time before I was associated with Arya and it it it’s completely separate from what I’m doing now but I did at one point maybe from 2010 to 2013 or so it it appeared to me that AI was a bit stuck so I was I was a little bit late to not that deep learning was working actually someone told me in 2010 who was quite early that deep learning was was was very promising it was probably going to work and I I dismissed it so I I made a mistake there and it it it took me it took me quite a few years to kind of notice it was in 2013 when I should have noticed in 2012 at least I mean even 2013 is quite early to notice that that deep learning is is is taken off 2010 would be legendarily early to notice that that deep learning is is taking off yeah so so during that time it seemed to me like there were basically no promising Pathways towards denovo AGI and yet that was the time when optogenetics was starting to take off which is the technology of genetically engineering biological neurons to light up in proportion to their level of activity to literally emit light fluorescent fluorescent light and to also be controllable so that they would receive light and translate the light into Spike trains and so it seemed like there was a new potential for a way of developing machine intelligence that was actually not not completely artificial but rather an emulation of biological neural networks very very a very very faithful emulation of a specific biological neural network so that that’s sort of the way that I would Define you know mind uploading and I I worked on this for the simp nervous system that is kind of known to science which is the nematode worm C Elegance which has a total of 302 neurons in its entire body and it’s exactly 302 you know unless it has a mutation so it’s very stereotyped and it’s actually despite that capable of learning a little bit and so it’s capable of learning to be averse to a particular scent like you know methane or carbon dioxide it’s capable of learning to be attracted to particular scent if it if it in its larel stage which is like its childhood was detecting those scents at the same time as it was detecting food then it sort of learns an association that this scent is indicative of food in its environment and then in adulthood is is sort of attracted to locomote towards the source of that scent so this was something I thought we could demonstrate that you could train a an actual worm in its laral stage to be attracted to some particular then in its adult stage um perform a uploading process which would involve optogenetically stimulating all of the neurons in various random patterns and observing the effect that it has on all of the other neurons and then build up a model of the coupling coefficients between all of the neurons as an ordinary differential equation and then run that differential equation in a simulation where it’s hooked up to a simulated body using soft matter physics uh and show that in the in that virtual environment it exhibits the behavior of being attracted to that particular odorant when you run it with with that in place and so what what happened you know how that went is the optogenetics was was basically was basically working it wasn’t completely mature um it was hard to get the the indicators localized to the center of the cell so because neurons are very closely packed together in real systems there is like a very small Gap almost no Gap at all between adjacent neurons so there is this kind of image processing problem it’s very ironic that I was a deep learning skeptic at the time because it might have might have been possible to solve this image processing problem with deep learning even then at the time I tried to approach it with basian inference and that did did not work there’s not enough or it didn’t it couldn’t have worked in real time so to really do what I had planned it would need to infer in real time what the state of the system is so that it could do automated experiment design in closed loop uh because you only get about an hour of of kind of optogenetic manipulation of of at at least at the time the optogenetics were were not super sensitive so you had to use very strong lasers and after about an hour of that like it was pretty damaged by the lasers so so you don’t get that much time to get a sort of reading of healthy behavior from the neural network so it needed to be it would it would have needed to be really closed loop automated experimental design to optimize the informational efficiency of of every stimulation and yeah the the Computing side of it for doing that image processing and Analysis in real time did not work but the there’s a funny thing where someone who I was working with on it in 2011 said you know it’s it’s it’s going to be too hard to like interpret these blurry images that we’re getting from from like light sheet microscopes but 10 years from now as like camera technology advances with spinning dis confocal microscopy you should be able to get clean enough images that you can use ordinary computer vision techniques to just segment out the cells and it turns out he was exactly right and he did it in 2022 Andy lifer he used spinning disc with you know more advanced cameras that existed in 2022 and sort of did did this process of cataloging and measuring the C coefficients between between all the neurons almost all the neurons I shouldn’t say all the neurons it’s actually still not quite done but like 2050 of the 302 neurons so so it’s still a little bit of incomplete the thing that I proposed to do uh but they’re starting to be more interested in it now just over this past year i’ I’ve I’ve heard a couple couple people suggesting that now that it’s been shown by by lifer and and others that the technology does basically work it just it’s just a matter of kind of making another push the way that I did way too early to create a project where the goal is to like finish it and and demonstrate in simulation that that that it does preserve learned behavior so I think that’s that’s on the horizon for C Elegance that’s 302 neurons and the scalability is you know is less than linear so the the the larger a system is the harder it is to because it’s three-dimensional the harder it is to get Precise Imaging of what’s going on in the interior of a nervous system um so as we get up to the scale of even a mouse Brin There is almost again qualified but there’s almost no physically feasible way of of reading out all of the neurons at once like we we can do now in celegans so it basically would require circulating through the vasculature something uh could could be like stent that are that are a fixed structure that sort of is biocompatible and flexible or it could be some kind of microscale robot things that go by the name of neural dust and those would then need to be powered and communicate via ultrasound because actually if they were powered by radio that would be too much radio it would damage it would be too much energy it would damage the brain so but ultrasound could work and then yeah then you could do do some of these experiments but but then like also it would be difficult to extract all the coupling coefficients because there are so many synapses in a human brain and and there’s only so much time that you have to run an experiment before a human dies even if it even if everything’s biocompatible humans have a finite lifetime so so it’s very very challenging my guess is that you know AI systems will accelerate things a lot and and maybe come up with new solutions that we hadn’t thought of but but it seems pretty clear at this point that we’re going to have denovo super intelligence before we have machine intelligence that’s emulating human nervous systems um so I mean and and that seemed clear already six or seven years ago to me so that that’s basically why I don’t spend much time on this direction anymore it’s basically a thing that you know we could explore we we should we should think about once we once we get through this sort of acute risk period of super intelligence do do you think there’s any hope of using brain emulation or or mind uploading as a way to elicit the preferences that we might not be able to express such that those preferences can be used for training AI systems to behave in ways we would like it’s a good question I mean again I think mind uploading per se is is just way too hard to be useful on that time scale people are talking about something called low-fi uploading which I think is a bit of a misnomer I would I would just call it imitation but like large language models are doing reasonably accurate um I would say if not precise imitation of human linguistic Behavior and if you if you f tune and llm on the the writings of a particular individual then it does become kind of precise you can kind of you know ask the llm like what do you think about this and like get a pretty good prediction of what the opinion of that person would be but it’s not perfect and it in particular is I think not at all a safe assumption that this type of imitation would generalize robustly to unprecedent Ed types of situations and unprecedented questions so I think it has a a limited usefulness from the perspective of of extracting preferences I think it’s much more much much more useful as a means of extracting surprise so if you look at the you know not the output that follows asking a question but the logits that are accumulated as the question is processed so how surprising is it that this would be a question that am asked that I think is quite reliable so if you’re asking a completely unprecedented question I think you’ll be able to tell mechanistically through running an llm running it through an llm this is not a question that humans are typically expecting and then you can use that as a way of of kind of guiding a process of refining a specification such that it doesn’t end up in situations that are really unprecedented and and and hard to hard to make judgments about and that would be useful because because people in general do not like to be surprised and so maybe you can you can gain some information about what a system should do based on how much that system surprises a human it’s not directly a value judgment it’s not that minimizing surprise is is a is an ethical imperative that I’m asserting active inference would maybe say that minimizing surprises is the main thing but I don’t I don’t think that is is quite right I think it’s actually a misinterpretation of of the underlying mathematics what I do think it’s useful for is because surprising questions are hard to answer well so it’s it’s a guide to not to to what is valuable or what do people like it’s a guide to where can we be confident that we know what people like and where we can’t be confident then we should you know be cautious in in preparation for this conversation I I scroll through your CV and in your early

01:36:14 Lessons from David’s early life

life you had a lot of accomplishments that I think would make it fair to say that you were a child projecty for example you you graduated from MIT at 16 and you’ve been working on theoretical kind of Adventures that are that are extremely Advanced at a very young age and one interesting question there is just how do you think about mentorship at an early age where you might be mature and advanced in your technical skills but you don’t have a lot of life experience in order to to judge what is what you should do or which direction is is is worth pursuing and so on how do how do you how do you think you you dealt with with that I’m assuming that there is there’s there’s something so like being mature technically speaking in in your technical abilities without necessarily having a lot of callid wisdom or life experience to know how to navigate everyday life yeah which is exactly the problem with early super intelligence right it’ll have very very strong technical capabilities but not necessarily wisdom so so I mean I don’t think I worked around this I think I worked on a lot of things including mind uploading that turned out not to be the most important things to work on and I I did try to be pretty sensitive to you know what what does seem like it’s actually important to work on and in in some ways that was a disadvantage because it led me to work on a lot of different things and kind of bounce around or at least it looks that way to some people from the outside but it did also enable me to end up landing on something that does feel like now very very important and and sort of aligned with a with with a meaningful purpose but I guess maybe one way of taking your question is like what advice would I give to child prodigies today yeah perhaps to to near child proes some of who might be listening where you’re very young and you’re working in in a technical field and you’re working with people who are much older than you maybe and and have more experience how do you how do you navigate the the tension between listening to authorities while also developing your own your own ideas so I think I do have an answer to this which is consider when there’s something that you’re uncertain about consider What observations would you have if it were true and What observations would you have if it were false and that includes observations includes the words that people around you say and instead of taking those words literally and trying to determine if the words are true or false think about them as the output of a data generating process where there’s Dynamics there’s incentive Dynamics where there’s psychological Dynamics and and where there’s cultural Dynamics and and there’s logical dynamics that are that are seeking truth and How likely is it that you would hear these words if they were true and How likely is it that you would hear these words if they were false and seek out observations that would very different in Worlds where this thing is true and worlds where this thing is false rather than seeking out a source of truth that literally would tell you in words whether it’s true or false you seek out some observations which might not be Words which might be data which might be papers which might be Capital flow that would help you to distinguish between the worlds where it’s true and the worlds where it’s false it seems like great advice in general but it also seems to imp a a lot of cognitive overhead but perhaps if we’re talking about child proes or or near child proes that that makes sense D thanks a lot for for talking to me this this has been great yeah thank you for having me

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

David Dalrymple on Safeguarded, Transformative AI | Future of Life Institute

Share This Story, Choose Your Platform!