“There’s an exam called ‘Humanity’s Last Exam’ and boy that is a dark name for a benchmark.” — Matthew Berman

Humanity’s Last Exam:

Intelligence Explosion: 3.3% to 26.6%

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

open aai just dropped deep research this is their second agent right after operator that is a combination of the powerful 03 reasoning model with tools most importantly the ability to search the web deep research given a topic will well go do deep research on it it will explore the web and come back with a fully cited research paper similar to exactly what a PhD researcher would provide except it takes just minutes and and that’s not even the craziest part Sam mman just said that deep research is already capable of doing a single-digit percentage of all economically valuable work while that may not sound like a lot it actually equates to trillions of dollars of value and if the term economically valuable work sounds familiar that’s because that’s open ai’s own definition of AGI now before I get ahead of myself let me break down deep research for you let’s start with a clip from the announcement Mark Chen head of research at open aai makes a stunning Proclamation their ultimate vision for agents is to uncover new knowledge let’s watch we think it’s important for our models to start doing autonomous tasks for much longer in an unsupervised way and this is core to our AGI road map as well I think our ultimate aspiration is a model that can uncover and discover new knowledge for itself and the first step here is a model that can go and synthesize and understand the models uh sorry the information on the web do you remember the situational awareness paper from Leopold Ashen brener in it he describes what he calls the intelligence explosion it is the point at which we reach AGI and it is able to discover new knowledge and then apply that new knowledge to itself and this is also known as recursive self-improvement and once that happens shortly after we will reach artificial super intelligence that’s because if an AI is able to discover ways to improve itself and you can spin up tens hundreds thousands millions of these agents at the same time imagine how quickly how exponentially it is able to improve itself and if this sounds like science fiction if this sounds like something that is so far in the future it isn’t it’s here right now in fact I just put out a tweet a few days ago that shows that deep seek itself was able to discover a way to make itself two times faster by the way if you’re not following me on X please do Matthew Burman Simon Willison made a post about how he basically just prompted deep seek to find a way to make itself two times faster and it did pretty insane all right now back to deep research let me show you it in action deep research is accessible from a button right here in the beginning of chat BT and from here you can immediately put in any query and it’s going to send it off to deep research help me find IOS and Android adoption rates the percent of folks who want to learn another language and the change in Mobile penetration over the past couple years and give me that difference between the top developed countries and developing countries and I also really want this information in a formatted uh Report with some tables and a clear recommendation on what the best emerging opportunities are for chbt so this is a query that would have taken me hours to put together but with deep research I can just immediately kick it off so what you’ll first see is that deep research comes back with a set of clarifying questions this is super important because if deep research is going on for 5 30 minutes you really want to get those requirements right and so there’s a couple questions that it’s giving to us right now these are really good questions that you’d expect an analyst to want to ask you when you’re giving them a really tough prompt and so it’s really important that you can capture these up front you know the model is really good at taking information that’s sometimes specified and a little bit more open-ended and using that to go off on a mission and get all the information that you need so you can see right now deep re has taken all of that and synthesized it and started kicking off its own research process and so what you’ll see over here is deep research pops open a little sidebar and it shows you all the reasoning that it’s doing so you can see right now it’s identifying you know the top countries it’s gathering information and it’s starting its process of searching for different uh information so zooming in over here you’ve seen that deep research is searching for information opening Pages reasoning about what it’s seeing under the hood what’s actually happening is that the model is conducting searches quite literally opening and browsing the pages and looking through all the different components including images tables PDFs and pulling out all of that information and using that to determine what it does next so imagine you have a PhD plus level researcher that you can send out on the web and do deep research on your behalf and it comes back after a few minutes or up to 30 minutes I’ve heard with a fully cited research paper and that’s what we just saw I want to thank the sponsor of this video chatbase chatbase is the ultimate platform for creating and managing intelligent AI agents for your business they are designed to streamline customer support generate leads and enhance business operations with Integrations across websites social platforms and other digital tools chatbase helps businesses provide fast consistent responsive and personalized interactions at scale and it is powered by Advanced AI models from open AI anthropic and others you can train these agents on your own unique data so they’re customized for your business and know to give the right answers at the right time and they just launched a brand new feature AI actions and if you watch this Channel at all you know I’m all about agents taking actions as the future of agentic workflows so now chatbase allows these agents to not only respond to your customers on your behalf but actually take real-time actions based on what they’re looking for some of the actions include getting data real-time data from their accounts or even allowing them to up update things in their account based on the conversation that they’re having with that agent so check out chatbase today thank you again to chatbase for sponsoring this video and now back to the video now let me skip ahead a little bit because there was about a 15minute delay after they kicked off this deep research agent until it finally came back with the results it actually looks like the task is still going on right now but in the meantime while we’ve kicked it off it’s already looked at 29 different sources and gone through a lot of different information oh wow okay perfect great timing incredible timing great so deep research just put together its full analysis it took us 11 minutes and in that process it looked at 29 different sites really in depth and as you can see live on this live stream it gave us a perfectly formatted report we got a nice introduction our different adoption Trends everything put together in a really great uh report style where you can see mobile TR penetration over time and a ton of different data and as you go down you can see it not only has information over here but also different uh um table formats and ways that it’s presented the the data in a way that’s super digestible so one of the other things that’s really cool about this model is that you’re able to click in and see all the different sources that it’s able to site U over here you can see every uh citation that the models encountered and also different sites that it might have encountered that it you know didn’t necessarily put into the final output but it wants to let you know that it uh found along the way and I know I’ve said this already in the video but that’s not even the craziest part deep research is already changing people’s lives for the better this is Felipe Millan and he is the government go to market at open Ai and he details a story in which his wife got cancer and he sent deep research out to basically figure out what the best chemotherapy is for that specific cancer given her specific information her age her health and other stuff and it came back with Incredible results today we at open AI launched deep research and I wanted to share a deeply personal story about how amazing this tool is and how it will change the world trigger warning related to cancer so at the end of October my wife was diagnosed with bilateral breast cancer overnight our world was turned upside down she had a double myectomy in early December and started chemo later in the month our recent challenge was whether she could do radiation after chemo for her specific case it’s completely in the gray area even the specialist we consulted gave mixed opinions no definitive answer we felt stuck can you imagine the smartest doctors in the world giving you different opinions and it’s up to you somebody who is not an expert in cancer to make that decision how daunting a task how stressful that must have been for them but that’s where deep research came in because I had preview access to deep research I decided to give it a shot we uploaded her surgical pathology report and asked for guidance on whether radiation would be beneficial what happened next was mind-blowing it didn’t just confirm what our oncologist mentioned it went deeper it cited studies I’d never heard of and adapted when we added details like her age and genetic factors we fact checked each study they were spoton now if that isn’t worth $200 a month and by the way it’s only on the Pro Plan right now but if that isn’t worth $200 a month I don’t know what is and then he lists out the exact prompt to use and it’s really only a few sentences long I am still in awe of the report deep researcher gave us we’re seeing another specialist soon but we all already feel more confident about our decision this wasn’t just a tech demo it was personal it gave us peace of mind when we needed it most we often talk internally at open AI about the moments when you feel the AGI and this was one of them this is going to change the world so I wish the best for Felipe and his family his wife and this is just such an incredible story about how AI is already changing lives so if you aren’t already convinced about how quickly AI is changing the world right now I hope this helps now let me get into the nitty-gritty a little bit let me tell you about the benchmarks for deep research because that in itself is pretty stunning so how does it actually work remember deep research is the 03 model which isn’t even out yet we have 03 mini but it’s the 03 model which isn’t even out yet we have 03 mini but this is the 03 model that has the ability to use tools including going out and searching the web it was trained using end-to-end reinforcement learning on on hard browsing and reasoning tasks across a range of domains it learned to plan and execute a multi-step trajectory to find the data it needs backtracking and reacting to real-time information when necessary the model is also able to browse over user uploaded files plot and iterate on graphs using the python tool embed both generated graphs and images from websites in its response and site specific sentences or passages from its sources and because of this training it reaches new highs on a number of public evaluations unbelievable let’s take a look at that so first there’s an exam called Humanity’s last exam and boy that is a dark name for a benchmark but here we are Humanity’s last exam is an evaluation that tests AI across a broad range of subjects on expert level questions it consists of over 3,000 multiple choice and short answer questions across more than 100 subjects from Linguistics to rocket science Classics to ecology so GPT 40 scored a 3.3% that means it got 3.3% right on this test now let’s move up Claude 3.5 Sonet scored 4.3 pretty good open AI 01 there we go 9.1 really good deepcar one actually really good 9.4 then we have the 03 mini medium and 03 mini High getting 10.5 and 13 but here’s the big jump open AI deep research 26.6% which is insane that is double what 03 mini high is able to achieve that’s because it has the ability to go search the web that is the key that is why agents are so powerful it has tools the raw intelligence of the model and even the reasoning ability of these models is awesome but it really becomes powerful when it can look up real-time information and reason about that information now here is the craziest chart this is the pass rate on expert level tasks by estimated economic value this is the measure of AGI according to open AI now here’s the pass rate so on the low estimated economic value it’s scoring just under 20% and 19% medium economic value 177% High economic value 15% very high 9.1 so on very high economically valuable tasks it gets 99.1% r which yeah that’s still low but that is an incredibly important metric and 9% is still crazy good all right so who gets it well Pro users first that’s $200 a month and they say right here it is very compute intensive that’s because it’s mixing the biggest model that they probably ever created plus reasoning which again uses a ton of tokens plus they’re probably storing all of this somewhere so it’s very compute intensive Pro users are limited to 100 queries per month plus and team users will get access next followed by Enterprise what an exciting time to be alive deep research is so cool and I think this is really the Deep seek effect deep seek drops something all of a sudden Sam Alman gets a fire lit under him and now they are releasing like crazy they released 03 mini and just a couple days later they released deep research and the thing is that’s not even the one more thing according to Sam Altman we’ll see what that is soon if you enjoyed this video please consider giving a like And subscribe and I’ll see you in the next one

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.