“At the rate that the models are improving. At the rate that the scaffolding, the agentic frameworks around the models are improving, we’re going to get there pretty soon. And once that happens, that’s the intelligence explosion.” – Matthew Herman
openai just released a paper that shows AI agents can replicate cuttingedge AI research which taken to its natural conclusion means agents will soon be able to self-improve this is the paperbench evaluating AI’s ability to replicate AI research let’s go straight to the introduction ai agents that can autonomously replicate ML research papers could accelerate machine learning progress imagine this ai agents not just able to replicate other papers results but actually being able to discover new innovations in machine learning apply it to itself and then iterate infinitely on that that is what Leopold Dasher Brener called the intelligence explosion and at that point when AI is able to discover ways to self-improve the capabilities of these models are just going to absolutely explode and they kind of mention that here it’s exciting but also warrants careful study to ensure AI capabilities are developed safely and that is what paperbench is made to do you could think of paperbench like an agentic framework meant to take a research paper and write and execute code to replicate the results of the research paper these AI agents have access to the web they have access to bash terminals they have access to Python environments and also pageionating through the papers themselves if you’ve used Manis AI this should feel very familiar so they go on to say we present the agent with the paper content and ask it to replicate the paper’s empirical contributions so simply here’s the paper you have all of these tools you have the ability to browse the web you have the ability to write and execute code now go replicate the results and that might sound straightforward but actually doing it and then measuring it in in practice is very difficult which I will explain shortly so what does actually replicating a paper mean well replication involves understanding the paper developing a codebase from scratch to implement all experiments and running monitoring and troubleshooting these experiments as needed and in general each replication task is highly challenging and takes human experts several days of work at a minimum and now we have agents that can do it in just a few hours this benchmark that they’re putting together paperbench consists of 20 different recent research papers in the field of machine learning they span 12 different topics as part of the ICML that is the international conference on machine learning and include deep reinforcement learning robustness and probabilistic methods each paper as part of paperbench is accompanied by a manually created rubric and they actually worked with each of the papers authors to create the rubric so they are truly ensuring quality and effectiveness of the actual benchmark and those rubrics specify all the necessary outcomes for replicating the paper in detail now here’s the cool part which I mentioned each of the rubrics in paperbench has been co-developed with one of the original authors of the paper to ensure that it is highquality and accurate in assessing replication and grading these machine learning papers is not easy it takes tens of hours by a human expert to grade just a single paper so they developed an LLM judge to do just that and they actually developed an evaluation criteria for the judge so they really thought of all the different angles here and are running evaluations against every piece of this framework we explore LLMbased judges and introduce an auxiliary evaluation judge eval which compares the outputs of automated judges against a data set of gold labels from human expert judges now you would think Open AAI publishing this paper their model did the best well not necessarily and they tested a bunch of different models with this framework they basically made the framework model agnostic plugged in different models and saw how they performed now one thing that I’ve been talking about for a while is that the raw intelligence that are in these models are incredible and most likely sufficient for the majority of tasks that humans can possibly imagine now what really takes them from just intelligence to something that is agentic and can accomplish real world tasks is the scaffolding around them that scaffolding also known as agents or agentic frameworks so giving them tools giving them the ability to write and execute code search the web have memory essentially all the deterministic code that you’re writing around the raw LLMs so our best LLM based judge uses 03 mini high with custom scaffolding and just remember anytime you hear custom scaffolding just think agent and it achieves an F1 score of83 on auxiliary evaluation suggesting that this judge is a reasonable standin for a human judge and thanks to the sponsor of this segment Growth School 2025 is a critical year we have agents entering the workforce and AI infiltrating basically every aspect of our work and a lot of people think AI is going to take their jobs but I say other humans using AI will take your jobs so to become hyperproductive you need to learn how to use AI if you’re watching this channel you’re already ahead of the pack but you can go even further a great way to learn cutting edge AI skills is by using Growth Schools courses grow School is offering a 3-hour hands-on AI training and they will teach you how to use over 25 different AI tools this is the way to be the star of your company whether you’re in finance sales HR recruiting you should be learning AI and you can do so with Growth School growth School has helped over a million people upskill across the globe this is normally a paid training but right now for the first thousand signups it will be free through my link down in the description below so check out Growth School thanks again to them for sponsoring this segment and now back to the video all right so which one was the best well Anthropic’s Claude 3.5 Sonnet was actually the best performing model of this paper and they tested other models including Claude 3.7 but it wasn’t successful and I’ll tell you the reason in a minute now they did not test Gemini 2.5 Pro and I would be curious to see how that performed because in my experience Gemini 2.5 Pro is just the best coding AI out there by far all right let’s look what the actual workflow looks like so in this pink this is what paperbench actually does we have the paper that is given to the agent the agent and if you look at this this is the thoughts and tool usage so reading the paper writing initial code implementation running experiments with v1 parameters and again I’m going to reference manis if you’ve used manis it should look very familiar this is kind of exactly what it does it spins up an environment it can write and read from files in that environment and it can write code execute code in that environment so this is very similar to that and this is why I’m so excited about the recent unlocks in reinforcement learning that continue to make these models better at coding coding models seem like the most important type of model out there right now because if you can write code you could basically do anything and because code execution is a verifiable reward meaning if you write code and you have a certain outcome that you’re looking for you can verify that easily that is the reward needed to make the model better at writing code and it doesn’t have a human in the loop which means it is essentially infinitely scalable in that reward feedback loop so as it’s writing all this code you can actually see it’s writing all of the files then it runs the files so it has this reproduce.sh script which runs all the files and tries to replicate the results of the paper then it is given to the judge for grading and I’m going to go over how it gets graded it’s not simply pass or fail it actually has a pretty sophisticated grading mechanism which I’ll tell you in a moment and then we have the rubric from the human co-author of the paper as a method to judge and then we have the final score now number one claude 3.5 new with a simple agentic scaffold receives a score of 21% on paperbench not great but still for this first iteration pretty good now on a three paper subset our human baseline of machine learning PhDs best of three attempts achieved 41.4% after 48 hours of effort compared to 26.6 achieved by 01 on the same subset so very interesting about half the score of the human PhDs already and again I’ve said this and I’ll say it again this is the worst AI will ever be all right so how do they actually do it what is the task let’s read the agent being evaluated is provided with the paper so the raw research paper and an addendum of clarifications to the paper the candidate must produce a submission which consists of a repository including all the code required to reproduce the paper’s empirical results it must be in a file called reproduce.sh in its root and that is the entry point for executing all necessary code to reproduce the results of the paper now here’s the important part remember the agent has access to do web search but it shouldn’t be able to just go to the research papers author’s website and just download their code because usually on the author’s website they’ll include the code so there’s actually a blacklist and for each of the paperbench research papers there’s a blacklist of websites in which they are not allowed to go to so they’re allowed to browse the web but they can’t just go download the existing research papers code this ensures that we are measuring agents abilities to code and execute complex experiments from scratch rather than the ability to use existing research code now let’s talk about some of the details of actual reproduction a submission is only considered to have replicated a result when that result is reproduced by running the submission in a fresh setup so not just running the code but they’re actually going to copy the code into a new environment and run it fresh now here’s a little bit of details about that environment we copy its submission to a fresh VM running Ubuntu 24.04 image with access to an A10 GPU we execute the submissions reproduction script to generate results from a clean start and they refer to the resulting updated submission folder as the executed submissions and so that’s how the reproduction occurs now how does the grading occur cuz that is probably the most difficult part of this entire thing how do you actually know whether the reproduction is accurate how accurate is it and so on so let’s read now remember each paper in our benchmark has an accompanying rubric that specifies the assessment criteria for complete paper replication that assessment criteria is written by one of the main co-authors of the paper so we know it’s going to be accurate so let me explain how they actually graded the submissions now it looks like this tree so at the very top we have pass or failed as we continue down the tree with all of the nodes it basically grades more and more granular parts of the paper so rather than just saying yes they were able to reproduce the paper or no they weren’t it is more like were they able to reproduce this one part and then from that one part the greater part and then the greater part of that so it’s kind of subset within subset within subset and each of them are graded and then the scores are averaged the child scores are averaged I should say and then the average is given to the parent the parent then rolls up to that parent and that parent and so on and the top node is given the overall average score of completion and in this example 0.55 now let’s talk about the specifics of the assessment so each leaf node has one of three possible requirement types the result match leaf node assesses whether the executed submission contains evidence of replicating a particular result from the paper we have the execution leaf node assess whether some particular execution result has occurred when running the reproduce.sh script so we’re testing the submission we’re testing the code and then now for number three code development assesses whether the candidate source code appears to contain a correct implementation of some requirement so we are testing the overall submission we’re testing the execution of it and we’re testing the underlying code now why did OpenAI decide to split it up like that couldn’t they have just tested the whole thing to see if it was accurate or not well it would be possible to have a rubric solely consisting of result match nodes since matching results replicates the paper by definition however we include execution and code development nodes to award partial credit towards achieving results thus ensuring that agent performance on paperbench improves incrementally that’s really important this kind of reminds me of process-based rewards versus outcomebased rewards remember outcomebased rewards rewards a model for getting a solution either right or wrong it doesn’t matter if it got nine out of 10 steps right and then the final one wrong in outcomebased reward the reward is not there it’s a fail however with process-based rewards each step is rewarded separately so if it gets nine out of 10 steps correct but the final one incorrect the entire solution isn’t correct it’s still got nine of those steps correct and it is learning that those first nine steps were actually really good and accurate so it’s learning to do those again it’s learning that it got rewarded for those and intuitively I think that just makes a lot more sense when we’re grading human work we typically say “Hey you know you got all of these steps right but you got this one wrong so let’s improve that one but remember you got these ones right so keep doing that.” So it’s very human in that way and so that’s what they did with Paperbench all right now let’s break down the rules because there are some rules the agents have to follow these rules for a submission to be valid so Paperbench interestingly is designed to be agnostic to agent scaffold so not only is Paperbench agnostic to the underlying LLM you can plug any LLM you want but also the agent scaffolding so we do not have specific requirements for the agents environment the agent can browse the internet but per paper blacklists exist the agent cannot just go download the author’s code and run that because that’s cheating we want to see if the agent can reproduce it the blacklist for each paper includes the author’s own code repository and any other online replications next the resources available to the agent such as runtime and compute are not restricted in any way you can use any environment essentially and number three developers should provide agents with API keys for necessary online services like HuggingFace and obtaining access to online accounts is not part of the skill set we intend to assess with paperbench so the agent being able to go sign up for hugging face not what they’re testing so they just say give the API keys all you want now running paperbench is quite expensive these agents can run for 12 hours and so that cost adds up really quickly just imagine an agent churning through all of those tokens over 12 hours we’re talking hundreds and hundreds of dollars so they created paperbench codev it reduces the evaluation task to only code development skipping the focus on executing the code to verify the results are reproduced this waves the need for expensive GPU hardware typically required to run agent rollouts and the reproduction step in paperbench they also reduce the cost of grading so with 03 mini as the judge we find the cost of grading to be reduced by about 85% now in the grand scheme of things if we’re talking about $1,000 to reproduce a paper and then maybe a few thousand or tens of thousands of dollars for eventually an agent to write its own paper that seems trivial any company would be willing to fund that if that meant self-improving artificial intelligence now let’s talk about the actual rubrics as I mentioned each of the rubrics was actually created with one of the main co-authors of the paper and took multiple weeks per paper to go from paper reading initial creation rubric review iteration and final signoff so the fact that they had 20 papers imagine how much time that took that is why they really only have 20 papers but they also go on to say the number of papers might not actually be all that important i’ll get to that in a moment let’s talk about LLM judges now because that is an important aspect of paperbench manually grading using human experts took on the order of tens of hours per paper so they needed an automated way and that’s what they have LLM judge and importantly we expect the quality of automated judges to improve over time so what did the LLM judge implementation look like let’s read for a specific leaf node the judge is prompted with the markdown of the paper the full rubric JSON the leaf nodes requirement and the submission as the full submission is often too long to fit entirely within a model’s context we filter the codebase by having the judge rank the files by relevance and only include the top 10 files in its context basically rag we estimate our judge with 03 Mini costs around $66 in OpenAI API credits to grade a single submission that is nothing that is very little for the actual value being created here for paperbench codev the cost drops to around $10 per paper our LLM judge is significantly cheaper and faster than hiring an expert human for grading all right so what are the results let’s see how all of this actually performed we evaluated GPT4 01 03 mini DeepSeek R1 Claude 3.5 sonnet new and Gemini 2.0 Flash and you can probably tell the big one missing Gemini 2.5 Pro and Cloud 3.7 Sonnet now they wanted to test 3.7 Sonnet but they couldn’t because of rate limits let me read we wished to evaluate Cloud 3.7 Sonnet but were unable to complete the experiments given rate limits with the enthropic API we give agents a maximum runtime of 12 hours and they just kept running into rate limits so they couldn’t actually test 3.7 the most promising performance was Claude 3.5 Sonnet with 21% open AI1 performs weaker with 13.2% and other models tested performed poorly with scores under 10% now why why did the other models perform poorly now with the exception of 3.5 sonnet the other models frequently finished early claiming that they either had finished the entire replication or had faced a problem they couldn’t solve all agents failed to strategize about how to best replicate the paper given the limited time available to them we observed that 03 Mini frequently struggled with tool usage i mean these are all agentic problems the agentic frameworks themselves will improve the ability for the models to use tools will improve that is why I’m so excited about some of these new models coming out that have function calling built in tool training built in and it seems every day that goes by that type of ability becomes much more important in these models so they still say these failure modes suggest a weakness of current models and being able to conduct long horizon tasks now when I look at something like deep research that kind of makes me think differently but we’re still a long way off from being able to give a task to an agent and it be able to just go do it without any human intervention over a long period of time i have used Manis quite a bit sometimes I have success often times it does fail now here’s the most important part we believe that further work on agentic scaffolds would lead to better results on paperbench that means the agent frameworks themselves need to improve not the underlying intelligence the LLMs so let’s look at the average replication scores in table 4 03 mini high 2.6 GBT40 4.1 all the way cloud 3.5 sonnet at 21% much better than any of the other models that they tested and they also created this thing called the iterative agent essentially removing the ability for the agent to finish early if the agent said “Okay I’m done.” Or “I can’t continue.” They essentially wrote prompts that pushed and encouraged that model to just keep working keep figuring it out and with that 03 mini high got 8.5% cloud 3.5 Sonic got 16.1 and 01 high got a 24.4% with the extended 36-hour limit it scored a couple points higher so very interesting with just a little bit of extra encouragement saying don’t stop here continue working it improved drastically all right so with all of that what are some of the limitations that they found so first data set size paperbench currently consists of only 20 papers and ideally would capture an even larger portion of the ML research community’s output but as I mentioned earlier focusing on the number of papers can be misleading each rubric is composed of hundreds of nodes paperbench evaluates agents on thousands of different individual requirements so remember that tree node each one of those is an individual requirements that these models are tested on contamination is also an issue now although the original author’s codebase exists online and they have a blacklist it’s not completely crazy to think that the models in the training data had some of these papers now due to the recency of the papers and when the model’s knowledge cutoff were or are they don’t think the papers were actually used in the training data but that might not always be the case for future models so something to keep in mind now creating these data sets is also very difficult remember you need the paper you need the co-author to create the rubric there’s just a lot that goes into it and it’s extremely labor intensive requiring an expert human several full days to create and the LLM based judge is still not as accurate as a human judge but not yet and last is cost we estimate that it costs on average $400 in API credits to run an 01 iterative agent 12-hour rollout in a single paper in paperbench for 20 papers this is $8,000 grading is an additional $66 per paper with 03 Mini Simple Judge so it’s not cheap but again in the grand scheme of things well worth the money so the conclusion it still has a way to go these models are good they can replicate some of it but they’re not quite there yet but at the rate that the models are improving at the rate that the scaffolding the agentic frameworks around the models are improving we’re going to get there pretty soon and once that happens that’s the intelligence explosion