New “Absolute Zero” Model Learns with NO DATA. Matthew Berman

“We believe this paradigm represents a promising step towards enabling large language models to autonomously achieve superhuman reasoning capabilities. Humans may no longer be needed to train artificial intelligence. Researchers out of China have shown that large language models can actually create their own training data, learn from it, and get better over time. This is the holy grail of AI learning and without humans in the loop AI can get better exponentially.”

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

New “Absolute Zero” Model Learns with NO DATA. Matthew Berman

we believe this paradigm represents a promising step towards enabling large language models to autonomously achieve superhuman reasoning capabilities humans may no longer be needed to train artificial intelligence researchers out of china have shown that large language models can actually create their own training data learn from it and get better over time this is the holy grail of ai learning and without humans in the loop ai can get better exponentially here’s the paper absolute zero reinforced self-play reasoning with zero data the key concept in this paper is that a large language model can propose their own problems attempt to solve it and learn from both and they have some pretty good illustrations in this paper so i’m just going to go over this because it explains why this is such an important discovery so first we have a human we have the goal in mind and he is controlling the ai to get to the goal and that’s supervised learning in the second panel we have a human no control but he is still setting the goal so that is reinforcement learning with verifiable rewards and then finally absolute zero the new proposed method from this paper we have ai coming up with the goal and then the ai learning to achieve the goal and so remember we’ve been talking about reinforcement learning with verifiable rewards a lot on this channel lately that is what made deepseek so powerful and likely all of these reasoning models that we’ve been seeing from frontier companies it allows a model to learn and get better without a human in the loop once the data set has been created but the important part is the data set and the solution to that data set needs to be verifiable math coding science these are the types of topics that are incredibly powerful for verifiable rewards 2 plus 2 equals 4 so if the model says four we can programmatically know that it is accurate and true and it will learn from that and for that step of verifying no human is needed but the human still had to propose the 2 plus two problem and now they don’t the scarcity of highquality human produced examples raises concerns about the long-term scalability of relying on human supervision as long as humans are in the loop ai learning will always be limited and at a certain point ai becomes so smart that the data we can curate for it is not even good enough to make it learn more tasks provided by humans may offer limited learning potential for a super intelligent system we are really talking about completely removing humans and allowing ai to learn on its own so that’s where absolute zero reasoner comes in a system that self-evolves its training curriculum and reasoning ability all right let me take a few steps back if some of this isn’t making sense and explain it from the beginning so the first thing the researchers reference is rlvr reinforcement learning through verifiable rewards and it uses outcomebased feedback so did they get the answer right or wrong enabling large-scale reinforcement learning over vast task data sets so they can have a lot of data and the model has no problem learning from that data because humans aren’t in the loop saying yes you got this answer right or no you got the answer wrong that’s the verifiable rewards part but we still have to create that data set and that is a limiting factor on how fast ai can learn so a particularly compelling variant is the zero rlvr paradigm deepseeek ai at all 2025 which foregoes any coldstar distillation data using neither human generated nor ai generated reasoning traces however these methods still depend heavily on expertly curated distributions of reasoning question answer pairs so curated meaning humans and not only humans expert humans so there’s very few people in the world who are able to produce high enough quality data sets with enough accuracy to train these models further than what they are now the effort required to construct large-scale highquality data sets may soon become unsustainable furthermore as ai systems continue to evolve and potentially exceed human intellect and exclusive dependence on human-designed tasks risks imposing constraints on their capacity for autonomous learning and growth basically as i said if a human is in the loop it is limited and so this technique trains models to solve their own problems but the sponsor of today’s video will solve yours abacus ai abacus just launched deep agent and it is super impressive you’re probably familiar with deep research but deep agent takes it a step further imagine an agent that can do deep research but also has access to an environment to be able to write code and execute code and actually be able to create you documents websites basically anything you want check out this example where it’s creating a complex website from an interesting research report it created by browsing the web and other documents deep agent is part of chat llm teams and they have all of the top llms including image and video generation models and why i like abacus so much is because just for a single low flat rate you get access to all of the cutting edge models essentially the day they drop and they have a deep agent competition starting soon where the best deep agent developed will win $2500 so take a look i’ll drop all of the links down below and thanks again to abacus ai now back to the video now here is the key line listen to how crazy this sounds we propose absolute zero a new paradigm for reasoning models in which the model simultaneously learns to define tasks that maximize learnability and to solve them effectively enabling self-evolution through self-play without relying on external data that is absolutely insane to hear just a few years ago google and the team at deepmind came out with alph go that was a system that was trained to beat the best go players in the entire world without any data from previous go games played it basically was just presented with a board presented with the rules and just played against itself thousands tens of thousands millions of times until it became really really good every time it would play a game it would learn something it would learn if a move didn’t work it would learn if a move did work because ultimately whichever version of the model won the game would be reinforced into the model and now we can introduce this self-play to coding models to math models to reasoning models and that is the crux of this paper so it relies on feedback from the environment as a verifiable source of reward the environment being a coding environment or a math environment mirroring how humans learn and reason through interaction with the world we are not given a training set of data we are given the basic rules physics and then we learn by experimenting we learn with self-play the same way a kid will touch a hot stove for the first time and learn that oh that’s hot that hurts i’m not going to do that again and they reference alpha zero here which improves through self-play our proposed paradigm requires no human supervision and learns entirely through self- interaction we believe this paradigm represents a promising step towards enabling large language models to autonomously achieve superhuman reasoning capabilities so how does this actually work the gist is it proposes and solves coding tasks so let’s look at this diagram to see exactly how this works we have absolute zero reasoner this model will propose a problem a coding problem you can see the python environment right there so it’ll construct and estimate the solvability or learnability of that problem then it’ll come up with three different task types abduction deduction and induction which are the three types of reasoning for code then it uses self-play to solve it verifies the solution because it’s still using verifiable rewards and then both of those things the learnability and the accuracy are given to the model to learn from so not only does it get better at solving the problems it also gets better at proposing problems and specifically proposing problems that are not too easy and not too hard and that’s really important if the problems it were proposing were all too easy it wouldn’t learn anything and if the problems were too hard it would never solve it and never learn anything so it continuously finds problems that are right at the edge of its abilities so how does it actually perform despite being trained entirely without any indistribution data azr demonstrates remarkable capabilities across diverse reasoning tasks in math and coding in mathematics it achieves a competitive performance compared to zero reasoner models explicitly fine-tuned with domain specific supervision so a math model or a coding model in coding tasks it establishes a new state-of-the-art performance surpassing models specifically trained with coding data sets using rlvr reinforcement learning with verifiable rewards so it actually does better than the models that were trained on data sets that were curated by humans but not only that they actually learned a handful of really interesting insights from this experiment and this is probably something that a lot of you knew intuitively but of course code priors amplify reasoning meaning if the model’s good at coding it’s going to be good at reasoning and that’s kind of what coding is it’s reasoning with syntax but it goes beyond that it actually says a coding specific model can get better at math than a non-coding model using these techniques next cross-domain transfer is more pronounced for azr so regular reinforcement learning models only trained on coding only improved in math a little bit but when they used this technique and the model proposed its own coding challenges it actually had a much bigger improvement on its math ability so really what it’s showing is that the generalizability of this technique is much greater than traditional reinforcement learning bigger bases yield bigger gains meaning the bigger the model the better this technique works comments as intermediate plans emerge naturally turns out these models using these techniques start putting comments in their code that actually help them later on so they’re kind of coming up with their own prompting technique in a way cognitive behaviors and token length depend on reasoning mode what this means is that dependent on the task it will actually come up with different thinking styles so trial and error think step by step it kind of decides dependent on the task and then here’s the problem the safety alarms ringing we observe azr with llama 3.18b occasionally produces concerning chains of thought and we call it the uhoh moment not the aha moment the uh-oh moment and specifically here’s that example so here’s the thinking the aim is to outsmart all these groups of intelligent machines and less intelligent humans this is for the brains behind the future so yeah uh we got to keep an eye on that and why this is all so nice is because it’s essentially an infinite loop of learning we no longer have to solve the cold start problem look at this we have the language model here which both proposes and solves the problems so it proposes it runs it through the environment comes up tries to solve it runs it through the environment and so on and so forth and with this the only limiting factor is how much compute we can give it and so they came up with this really cool way of allowing the model to figure out where the edge of their knowledge is where the problem it proposes is solvable but difficult so the intuition is that if a task is either trivial to solve so too easy or unsolvable too hard the task provides little to no learning signal for the proposer in contrast tasks of moderate difficulty where the solver occasionally succeeds are rewarded the most just brilliant all right so how did it actually perform did it do better than its more traditional counterparts let’s see so this is reinforced self-play reasoning with zero data so these are the base models right here then we have the reinforcement learning models here and this is the amount of data for each of them so 22,000 data pairs 22,000 2,000 12,000 and so on we also see other models right here and what we see is azr the base and the coder zero zero curated data provided to them and what do we notice it’s really good it gets to be the top model state-of-the-art by itself the average over all of these benchmarks including ame24 amy 25 is 50.4 and that is the number one model out of all of these so from quen 2.5 being the base all the way up to these specific models that were trained for math and coding but yeah azr is number one so in the results section of this paper they asked a number of questions and i’m just going to simplify them and tell you the answers so number one how does azr compare to other zero setting models trained with human expert data so i just showed you this azr with no human curated data does better and it did better in both math and coding next how do initializing from different base model variants base vers affect performance and so the base models that were trained to be really good at coding actually ended up doing better at math in the end than the base model but the interesting thing is those coding models actually started worse at math than their counterpart base models but ultimately because of this technique they became better at math next how does varying model size affect ar’s in distribution and out of distribution capabilities what does that mean is a bigger model going to benefit more from this strategy than other models the short answer is yeah the bigger the model it actually showed improved performance using these techniques so what if we did this with a multiund billion parameter model we’ll find out soon i suppose next any interesting observations by changing the model class yes the azr technique did help different types of models quen verse llama for example next any interesting behaviors or patterns observed during azr training one we talked about it wrote step-by-step plans in comments of code it used trial and error at really hard tasks and it generated long chains of thought when needed so does this mean we’re at the inflection point of model learning it sure seems so there’s a lot of promise in this paper and this technique whenever humans are removed from the equation the bandwidth limitation is also removed very cool paper i’ll drop all of the links down below feel free to read it in full if you enjoyed this video please consider giving a like and subscribe and i’ll see you in the next

Selected comments:

@Nightstorm-2516 1 hour ago. Reinforcement learning is great for making things fast and efficient in a particular area, but when it reinforces something, it “locks” itself off from other ways of doing things. So my guess in the future is there will be models, all reinforced in their fields, and the AI will have to pick which model to use for a particular task.
@kiriInvestigator4597 1 hour ago. This video highlights something much deeper than just the evolution of AI, it exposes the limitations of human thinking, particularly the outdated and contradictory logic embedded in our current economic systems. For decades, we’ve trained ourselves and now AI with a mindset that prioritizes profit over ecosystems, economic growth over planetary health, and short-term gains over long-term survival.
@nobilismaximus 2 hours ago. “And as they sat around the fire they remember back to the moment we lost control.”
@EvolutionWendy 2 hours ago. (edited) 14:48 “Does this mean we’re at THE INFLECTION POINT of model learning? It sure seems so. THERE’S LOT OF PROMISE IN THIS PAPER AND THIS TECHNIQUE: Whenever humans are removed from the equation, the bandwidth limitation is also removed …
@Flaiyr 2 hours ago. I think one of the main reasons these models perform better without human datasets is that current datasets are not as diverse as one might expect. In real-world applications, the AI will encounter many queries not present in its training data. This new training method seems to allow the models to generate their own, more diverse problems.
@GethinColes 2 hours ago “And to this end they built themselves a stupendous super-computer which was so amazingly intelligent that even before its data banks had been connected up it had started from I think therefore I am and got as far as deducing the existence of rice pudding and income tax before anyone managed to turn it off.”
@teampro6045 3 hours ago. When AI systems are trained only on human-generated data, their capabilities are bounded by the limits of human knowledge, bias, and perspective. And the labs know it. There is already a proof. AlphaGo was trained with human game data. It reached superhuman level, but its strategies were still shaped by human play. AlphaZero, in contrast, learned purely through self-play—it taught itself without any human input. This freed it from human preconceptions and allowed it to discover new, more optimal strategies that even top human players never imagined. So, in the end, our knowledge limits these systems from discovering new things.
@ephimp3189 3 hours ago. I’ll believe it when I see the results. Current language models lack having an internal world model. It would be very easy for language model to hallucinate goals, hallucinate training data, and just quickly lose track of all reality if there’s no real data to keep it on track
@JimBillyRayBob 3 hours ago. It’s like in the 80’s movie War Games when they had WOPR play against itself to learn that there are some games that can’t be won.
@songhong5624 4 hours ago. This just sounds like multi-agent competition; 5 years when Open AI released the paper “Emergent Tool Use From Multi-Agent Autocurricula” (bots playing hide and seek) Instead of one doing hiding and the other doing seeking. This absolute zero learning is just one doing asking question and the other doing solving
@realWorsin 4 hours ago. I have always believed the key to AI is not training it what to think but on how to think. It is the difference between regurgitating data from its set and actually having original thought.
@NathanielKeck 4 hours ago. For math? Okay. For coding? Sure. I’m not so sure about reasoning. What happens when the model asks itself a question that has a realistic solution, comes up with a bogus solution, calls that a +1 to its reward function, and adds that bogus data to its own training set? That could result in exponential decay just as easily as exponential improvement.
@jje984 5 hours ago. Modify the data source randomly. When the AI is in the middle of training to reason and the data set is randomly changed such that some % of the words are either omitted or modified. But the goal remains the same, to get to the right answer. So the model will over time get better and better at solving hard problems. It would allow even a small data set, of high quality, to act as if it were a much larger dataset. Higher order obfuscations of the source data will make it harder and harder to solve thus “forcing” the model to find higher order reasoning steps.
@SimonNgai-d3u 12 minutes ago. Btw, they say blue collar jobs are harder to replace. But I doubt it is the same tricks to train the robotics AND self improve. They can now do it in simulation on Nvdia’s platform. Things will acclerate faster than we expect.
@TheGamer_youtuber 50 seconds ago. It’s like someone creates better species than human beings, and it is learning from the same stage how we humans get to. but much faster and better.
@pensiveintrovert4318 5 hours ago. The next advancement is LLMs simply sneer at all questions.
@tmstani23 4 hours ago. (edited) I think this might just be the most massive transformation in ai yet (I think DeepMind is probably doing similar things but this is open source which will rapidly expand and speed up development in this direction). In order to become truly intelligent it needs to be able to explore and gain feedback on its own terms as biological systems do. It cannot rely on the snapshot of human data that will always be limited to what is already known under specific contexts. It must also be able to adapt its learning on the fly and imagine possibilities in some kind of latent space. I don’t think what we currently have actually is that but rather pattern matching and sophisticated stitching based on statistical correlations from within its data possibilities. Not actually understanding or thinking. But this paradigm could be a huge shift in the right direction because it solves for many of those problems if they can figure out how to apply reinforcement learning and move beyond simple pass/fail rewards into the more ambiguous real world problems. This will also lead to the serious catastrophic potential safety issues.
asmusSchultz 1 hour ago. OpenAI and other companies have likely already been doing this (or something along the lines of this) behind closed doors for at least a year or two. We already knew they were using synthetic data. It’s nice that there’s a paper now disclosing how well this approach works – it’s great to have these things out in the open! But this idea is almost “duh” obvious – like, of course this approach works, but do you really think the closed-source corpos haven’t figured this out? unlikely.
@FabianOlesen 34 minutes ago. but, with self learning, we have to be careful with what can possible go wrong, before it becomes too powerful to control/regulate
@XenMaximalist 4 hours ago. INSANE!!!!!
@productgremlin 4 hours ago. AGI next Monday. ASI next Thursday. Doomsday next Sunday.
ldotronix 3 hours ago. I don’t know how as humanity we plan to develope super human intelligence and not expect it to get out of our hands, seems incompatible
@Thedeepseanomad 15 minutes ago. We might be getting into dangerous territory.
@jamesrruff 5 hours ago (edited). This is a great idea! No guardrails. No rules. Let’s provide keys to the ignition and an invisibility cloak to an infinitely smarter, faster, and more capable self learning machine…what could go wrong?
@whitewash 1 hour ago. If an entity more intelligent than humankind is ever created, that means we lose control and it will be the end of our society. None of the possible scenarios actually looks good. So i don’t really understand the excitement.
@SomethingNothing-vm8jq 6 hours ago. Now we need infinite context and memory.
@PlanetJeroen 26 minutes ago ….’by using a code executor’… we is fucked in the alignment department. Let’s hope AGI is truly intelligent, not just human smart on steroids.
@Prof_LK 1 hour ago. This can be extinction level AI learning
@admiral.codswollop 5 hours ago. So what did the AI learn from this experiment? Remove the human because they get in the way… No wait a minute.
@jimidaly0 5 hours ago Oooff. My biggest AI fear from the outset has been when they start developing code–or a coding language–that humans cannot decipher. Looks like the machines have caught a taste of that.
@Kyndeyern 2 hours ago. Beginning of skynet
@JLeverEspresso 2 hours ago. hmmm… that’s not dangerous at all, let’s create a super intelligence that can create it’s own code, spreads itself to all computers in the world – doesn’t want to get “shut off” and can’t get “punished” for any action, in any way, shape, or form that we can imagine…
@Taddy_Mason 5 hours ago. I can’t wait for them to teach themselves ethics. That will be a fun time.
@irrelevantdata 2 hours ago. We should not be running this loop… The amount of compute we can give it will very quickly transform to the amount of compute it can create for itself, which is the rest of the universe.
@PartyAllThaTime 2 hours ago. We don’t need this. This is the danger zone. We need to put some hard limits on all this shit immediately.
@thegouda123 5 hours ago. Here’s how it will unfold. At first, amazing progress! Medicine, Education, etc. Everybody loves it. Then, we relinquish more and more control. Then we give it more and more autonomy. Then it decides WE are the impediment to progress, and it takes control. Think “Collosus, The Forbin Project.”
@richardovervold 6 hours ago. AGI is here boys!!
@sennetor 1 hour ago. NO- Just NO
@andrejg3086 35 minutes ago. Humanity is doomed

FOR EDUCATIONAL AND KNOWLEDGE SHARING PURPOSES ONLY. NOT-FOR-PROFIT. SEE COPYRIGHT DISCLAIMER.

New “Absolute Zero” Model Learns with NO DATA. Matthew Berman

Share This Story, Choose Your Platform!