“We believe this paradigm represents a promising step towards enabling large language models to autonomously achieve superhuman reasoning capabilities. Humans may no longer be needed to train artificial intelligence. Researchers out of China have shown that large language models can actually create their own training data, learn from it, and get better over time. This is the holy grail of AI learning and without humans in the loop AI can get better exponentially.”

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

we believe this paradigm represents a promising step towards enabling large language models to autonomously achieve superhuman reasoning capabilities humans may no longer be needed to train artificial intelligence researchers out of china have shown that large language models can actually create their own training data learn from it and get better over time this is the holy grail of ai learning and without humans in the loop ai can get better exponentially here’s the paper absolute zero reinforced self-play reasoning with zero data the key concept in this paper is that a large language model can propose their own problems attempt to solve it and learn from both and they have some pretty good illustrations in this paper so i’m just going to go over this because it explains why this is such an important discovery so first we have a human we have the goal in mind and he is controlling the ai to get to the goal and that’s supervised learning in the second panel we have a human no control but he is still setting the goal so that is reinforcement learning with verifiable rewards and then finally absolute zero the new proposed method from this paper we have ai coming up with the goal and then the ai learning to achieve the goal and so remember we’ve been talking about reinforcement learning with verifiable rewards a lot on this channel lately that is what made deepseek so powerful and likely all of these reasoning models that we’ve been seeing from frontier companies it allows a model to learn and get better without a human in the loop once the data set has been created but the important part is the data set and the solution to that data set needs to be verifiable math coding science these are the types of topics that are incredibly powerful for verifiable rewards 2 plus 2 equals 4 so if the model says four we can programmatically know that it is accurate and true and it will learn from that and for that step of verifying no human is needed but the human still had to propose the 2 plus two problem and now they don’t the scarcity of highquality human produced examples raises concerns about the long-term scalability of relying on human supervision as long as humans are in the loop ai learning will always be limited and at a certain point ai becomes so smart that the data we can curate for it is not even good enough to make it learn more tasks provided by humans may offer limited learning potential for a super intelligent system we are really talking about completely removing humans and allowing ai to learn on its own so that’s where absolute zero reasoner comes in a system that self-evolves its training curriculum and reasoning ability all right let me take a few steps back if some of this isn’t making sense and explain it from the beginning so the first thing the researchers reference is rlvr reinforcement learning through verifiable rewards and it uses outcomebased feedback so did they get the answer right or wrong enabling large-scale reinforcement learning over vast task data sets so they can have a lot of data and the model has no problem learning from that data because humans aren’t in the loop saying yes you got this answer right or no you got the answer wrong that’s the verifiable rewards part but we still have to create that data set and that is a limiting factor on how fast ai can learn so a particularly compelling variant is the zero rlvr paradigm deepseeek ai at all 2025 which foregoes any coldstar distillation data using neither human generated nor ai generated reasoning traces however these methods still depend heavily on expertly curated distributions of reasoning question answer pairs so curated meaning humans and not only humans expert humans so there’s very few people in the world who are able to produce high enough quality data sets with enough accuracy to train these models further than what they are now the effort required to construct large-scale highquality data sets may soon become unsustainable furthermore as ai systems continue to evolve and potentially exceed human intellect and exclusive dependence on human-designed tasks risks imposing constraints on their capacity for autonomous learning and growth basically as i said if a human is in the loop it is limited and so this technique trains models to solve their own problems but the sponsor of today’s video will solve yours abacus ai abacus just launched deep agent and it is super impressive you’re probably familiar with deep research but deep agent takes it a step further imagine an agent that can do deep research but also has access to an environment to be able to write code and execute code and actually be able to create you documents websites basically anything you want check out this example where it’s creating a complex website from an interesting research report it created by browsing the web and other documents deep agent is part of chat llm teams and they have all of the top llms including image and video generation models and why i like abacus so much is because just for a single low flat rate you get access to all of the cutting edge models essentially the day they drop and they have a deep agent competition starting soon where the best deep agent developed will win $2500 so take a look i’ll drop all of the links down below and thanks again to abacus ai now back to the video now here is the key line listen to how crazy this sounds we propose absolute zero a new paradigm for reasoning models in which the model simultaneously learns to define tasks that maximize learnability and to solve them effectively enabling self-evolution through self-play without relying on external data that is absolutely insane to hear just a few years ago google and the team at deepmind came out with alph go that was a system that was trained to beat the best go players in the entire world without any data from previous go games played it basically was just presented with a board presented with the rules and just played against itself thousands tens of thousands millions of times until it became really really good every time it would play a game it would learn something it would learn if a move didn’t work it would learn if a move did work because ultimately whichever version of the model won the game would be reinforced into the model and now we can introduce this self-play to coding models to math models to reasoning models and that is the crux of this paper so it relies on feedback from the environment as a verifiable source of reward the environment being a coding environment or a math environment mirroring how humans learn and reason through interaction with the world we are not given a training set of data we are given the basic rules physics and then we learn by experimenting we learn with self-play the same way a kid will touch a hot stove for the first time and learn that oh that’s hot that hurts i’m not going to do that again and they reference alpha zero here which improves through self-play our proposed paradigm requires no human supervision and learns entirely through self- interaction we believe this paradigm represents a promising step towards enabling large language models to autonomously achieve superhuman reasoning capabilities so how does this actually work the gist is it proposes and solves coding tasks so let’s look at this diagram to see exactly how this works we have absolute zero reasoner this model will propose a problem a coding problem you can see the python environment right there so it’ll construct and estimate the solvability or learnability of that problem then it’ll come up with three different task types abduction deduction and induction which are the three types of reasoning for code then it uses self-play to solve it verifies the solution because it’s still using verifiable rewards and then both of those things the learnability and the accuracy are given to the model to learn from so not only does it get better at solving the problems it also gets better at proposing problems and specifically proposing problems that are not too easy and not too hard and that’s really important if the problems it were proposing were all too easy it wouldn’t learn anything and if the problems were too hard it would never solve it and never learn anything so it continuously finds problems that are right at the edge of its abilities so how does it actually perform despite being trained entirely without any indistribution data azr demonstrates remarkable capabilities across diverse reasoning tasks in math and coding in mathematics it achieves a competitive performance compared to zero reasoner models explicitly fine-tuned with domain specific supervision so a math model or a coding model in coding tasks it establishes a new state-of-the-art performance surpassing models specifically trained with coding data sets using rlvr reinforcement learning with verifiable rewards so it actually does better than the models that were trained on data sets that were curated by humans but not only that they actually learned a handful of really interesting insights from this experiment and this is probably something that a lot of you knew intuitively but of course code priors amplify reasoning meaning if the model’s good at coding it’s going to be good at reasoning and that’s kind of what coding is it’s reasoning with syntax but it goes beyond that it actually says a coding specific model can get better at math than a non-coding model using these techniques next cross-domain transfer is more pronounced for azr so regular reinforcement learning models only trained on coding only improved in math a little bit but when they used this technique and the model proposed its own coding challenges it actually had a much bigger improvement on its math ability so really what it’s showing is that the generalizability of this technique is much greater than traditional reinforcement learning bigger bases yield bigger gains meaning the bigger the model the better this technique works comments as intermediate plans emerge naturally turns out these models using these techniques start putting comments in their code that actually help them later on so they’re kind of coming up with their own prompting technique in a way cognitive behaviors and token length depend on reasoning mode what this means is that dependent on the task it will actually come up with different thinking styles so trial and error think step by step it kind of decides dependent on the task and then here’s the problem the safety alarms ringing we observe azr with llama 3.18b occasionally produces concerning chains of thought and we call it the uhoh moment not the aha moment the uh-oh moment and specifically here’s that example so here’s the thinking the aim is to outsmart all these groups of intelligent machines and less intelligent humans this is for the brains behind the future so yeah uh we got to keep an eye on that and why this is all so nice is because it’s essentially an infinite loop of learning we no longer have to solve the cold start problem look at this we have the language model here which both proposes and solves the problems so it proposes it runs it through the environment comes up tries to solve it runs it through the environment and so on and so forth and with this the only limiting factor is how much compute we can give it and so they came up with this really cool way of allowing the model to figure out where the edge of their knowledge is where the problem it proposes is solvable but difficult so the intuition is that if a task is either trivial to solve so too easy or unsolvable too hard the task provides little to no learning signal for the proposer in contrast tasks of moderate difficulty where the solver occasionally succeeds are rewarded the most just brilliant all right so how did it actually perform did it do better than its more traditional counterparts let’s see so this is reinforced self-play reasoning with zero data so these are the base models right here then we have the reinforcement learning models here and this is the amount of data for each of them so 22,000 data pairs 22,000 2,000 12,000 and so on we also see other models right here and what we see is azr the base and the coder zero zero curated data provided to them and what do we notice it’s really good it gets to be the top model state-of-the-art by itself the average over all of these benchmarks including ame24 amy 25 is 50.4 and that is the number one model out of all of these so from quen 2.5 being the base all the way up to these specific models that were trained for math and coding but yeah azr is number one so in the results section of this paper they asked a number of questions and i’m just going to simplify them and tell you the answers so number one how does azr compare to other zero setting models trained with human expert data so i just showed you this azr with no human curated data does better and it did better in both math and coding next how do initializing from different base model variants base vers affect performance and so the base models that were trained to be really good at coding actually ended up doing better at math in the end than the base model but the interesting thing is those coding models actually started worse at math than their counterpart base models but ultimately because of this technique they became better at math next how does varying model size affect ar’s in distribution and out of distribution capabilities what does that mean is a bigger model going to benefit more from this strategy than other models the short answer is yeah the bigger the model it actually showed improved performance using these techniques so what if we did this with a multiund billion parameter model we’ll find out soon i suppose next any interesting observations by changing the model class yes the azr technique did help different types of models quen verse llama for example next any interesting behaviors or patterns observed during azr training one we talked about it wrote step-by-step plans in comments of code it used trial and error at really hard tasks and it generated long chains of thought when needed so does this mean we’re at the inflection point of model learning it sure seems so there’s a lot of promise in this paper and this technique whenever humans are removed from the equation the bandwidth limitation is also removed very cool paper i’ll drop all of the links down below feel free to read it in full if you enjoyed this video please consider giving a like and subscribe and i’ll see you in the next

Selected comments:

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.