Transcript: Why Self-Distillation Is Taking Over LLM Post-Training
Source: https://www.youtube.com/watch?v=OgEGV7apEzI
Channel: Deep Learning with Yacine
Video ID: OgEGV7apEzI
Duration: 1:31:20
Segment count: 2327
Raw transcript extraction produced by the local youtube-content helper on 2026-05-02.
{
"video_id": "OgEGV7apEzI",
"segment_count": 2327,
"duration": "1:31:20",
"full_text": "Currently most post- training of large language model is done via reinforcement learning method like GRPO. The way it work is you take a problem you have the model generate a bunch of rollouts you score each of the roll out with a verifier usually did you get the right answer yes no and then you update the weights based on which attempt were good and which were bad. The issue is that this reward signal is extremely sparse. You get one score per entire rollout. So the model has to figure out on its own which of the token in a thousand token chain of thought actually matter. This is a pretty brutal credit assignment problem. In this video we're going to explore a family of method that kind of sidestep this bottleneck entirely using something called self distillation. The core idea is that in many of these environment you actually already have rich textual feedback. Things like compiler error runtime exception judge evaluation that sort of stuff. These explain why an attempt failed and current oral method just don't use that information and in the extrem they collapse all of this into a binary reward. What self distillation does is it takes the model own give it back to the same model but now condition on the whatever feedback you got from the environment and it let it re-evaluate its own tokens. In this distillation setup, the teacher and the student are literally literally the same model, but the teacher just see more context. This create a dense token level learning signal while at the same time being cheap to produce since it's literally just a single forward pass over the existing rollout. This approach was developed by two groups working together. We have the people at ETH Zurich with Yonas, hope I pronounced the name right, who developed the SDPO for the reinforcement learning setting. And then we had Eden Shenfield at MIT who developed the SDFT for continual learning from demonstration. Both paper dropped around the same time in January 2026 and they actually like met earlier on in 2025 to discuss this idea. What I really like about this paradigm is that it's simple. It bootstrapped learning using the model own in context learning ability and it just literally actually worked really well. On the SDBO side, it reaches GRPO accuracy six times faster in wall clock time and produce reasoning traits that are up to 11 times shorter. On the continual learning side, SDFT let a single model learn multiple skills sequentially without forgetting the previous one which standard SFT completely fails at. This family of method already getting picked up in production with system like openclaw RL and frontier open source model like GLM5 were using similar approach in their post- training pipeline. Yonas and Edan will walk us through both method and how self distillation work at a high level and I'll be asking them a whole bunch of questions. Thank you to everybody that sent question my way uh during the live stream. It was super helpful. Enjoy. >> So hi, happy to be here. Um Idan Shenfeld, I'm currently a PhD student in MIT working with professor Pulkit Agawal. Um my research is focused on reinforcement learning algorithms mainly for LLMs but also for robotic applications. Uh before my PhD uh I did research on reinforcement learning in my undergrad as then work on autonomous driving for a bit was part of General Motors big autonomous driving project. did a lot of cool applications of reinforcement learning for the real world there trying to solve problem that actually involve a lot of moving parts like other drivers for example pedestrian etc. Um, I was also at Deep Mind was part of the post training team as an intern at summer 24. Um, yeah, >> man. You were everywhere. [laughter] How old are you? You you're you're already done like the average student. A bit older than Leonas, that's for sure. >> But I I wanted to ask you this um because I saw that you were like doing work that was more practical and then like doing the research uh stuff. Uh what would you say is like the biggest difference in like mentality that you have when you do that type of like more practical minded work that you know is going to go into like I don't know like a car or something like that and like some the more research u long-term type of stuff like how's the switch going in your in your uh in your work? >> Yeah. So I think when you work on an actual product actual application what you really care about is performance. uh understanding is just a tool to get the right performance and sometimes you know like as much as understanding is fun and research is fun just going through your data set and cleaning it give you way more performance boost than anything else. So it's kind of you need to kind of stop yourself from focusing too much on what is you know like cool or fun or new and do the like you know grind work first and only then like put effort well I feel and this is one of the reasons I like like went from industry to PhD that as a researcher that don't care about a specific application the thing that we bring into the world is new understanding is new knowledge and this require completely different like perspective I don't care about like being state-of-the-art on any benchmark. Uh I also think and this is in general good that the community kind of went away from that. If you saw papers from you know 20 I don't know 18 1920 it was all about being number one on the imaginet leaderboard stuff like that and I think that these days you know people still use benchmark like they produce numbers on a and stuff like that but it's not really about being number one because there is always a bigger model that will be number one. Um, it's all about what kind of new understanding, new knowledge, new perspective you give to the community. And I think that this is much more fun than just, you know, playing the numbers game. >> Yeah. Yeah. Yeah. I I agree here. I I totally agree. I also agree that like to be honest like because I have like a consulting practice where I literally just go with like businesses and we try to implement the stuff and they always want to go with for the flashy thing but most of the gains are just like hey let's let's look at each of the data point one by one right >> and see like and see how how messed up they are and then like you clean it up and then you automatically gain like a massive gain and then you're kind of done like everything else is extra um Cool. This is really nice. Um I had the other question but I'm going to leave them for later. Uh Yianas you can go uh go ahead. >> Yeah. Awesome. Also great to be here. I'm Yonasota. I am PhD student at ETH in Zurich working with Andreas Kraza. Yeah. My research has primarily been focused on something that's called testime training which is effectively this idea that you can continue training a model at inference once it's been given a task. And my primary motivation is the question of how can we have models that are at inference time deployed in a new environment and then continue learning and exploring within that new environment to eventually solve very hard tasks which initially were completely out of distribution of their initial abilities, right? Um and that entails a lot of different questions and of obviously it touches on continual learning. It touches on this question, how do you do you do effective exploration and a lot of other questions as well. A lot of my prior work let's say to this was primarily focused on this question of how do you do effective exploration within that new environment. Yeah. Before that I studied computer science in Zurich and in Munich. >> Good. And I had a question about like your um efficiency learning at test time active fine-tuning of LM influence uh sorry active fine tuning of LM it feels like really like um what's it like um an extension SDPO of like uh of that stuff that you were doing like were you really influenced by that work and just like the the next logical step >> uh to to some extent yes right like this other this other work that you mentioned in that context we were asking okay how can to continue training and improving an LLM at test time. Once it's given a task through gradient descent, there's obviously there's several aspects that are important when you want to improve your LLM. There's important what data do you do you learn from? And that's the question that we asked in that work. But then as we discovered in subsequent works, that's not the only question. Another very important question is what loss are you optimizing? and and so especially when you learn at test time it's very important that you one are very efficient in turning your data into gradient updates right um purely also from a practical perspective if you want to run these algorithms and do interesting things with these and demonstrate these interesting environments and we are compute bound we have to make the most out of the signal that we get that's number one and number two is I think in settings we also So usually just need better uh signals and you know just extract as much signal from environment as possible. >> Yeah. Yeah. I agree. I I really like how like because in in my view like the RL stuff there's just so much especially when it varable reward like you there's just so much extra things that like an actual human being that will would go through the same process. it will look at this stuff right they will look at the error trace and they will look like at the I don't know the documentation like the feedback from like an a demo or something like that >> and then we'll just integrate it in like its learning procedure 100% >> but ju just giving this the the end like hey you messed it up like oh okay sorry all the rollouts >> bad >> um it always felt a bit wasteful and they were that we were kind of missing something um and like I I I liked I like the the all the studies where there was teacher that was trying to generate rollouts, but then like you're all in token space and it's just weird like what you're going to do to match the the two kind of stream. So no, I was really excited when I saw like how you you guys were doing it. Um there was pretty good >> the original motivation was actually yeah for me coming from pretty much that angle because we were working on coding at the time and it just seemed obvious that you know at at that time the coding agents were not that good as they are now. And so when I was trying to make chatbt code, it was like this back and forth chatbt generating some code. I'm running it. I'm pasting back the error. And you know, it was at the time where where it started to get decent at that. So the intuition was okay, it's already very good at or somewhat it's starting to become very good at understanding what are the sources of the errors it makes. >> Yeah. And um I actually before we jump into the presentation, there was also your paper or razor why online referencement learning forget less. That's also something that is um that got me even more excited is that like you with this type of self distation method it seems I think it's easier to weave in like different uh learning that you you you you put there is some forgetting as as the the thing goes but it seems to be the type that is normal I would say for like um I don't know a human like if you were to learn like four things in a row right okay you will like forget a bit the first thing and stuff. Um but you will not catastrophically forget all the thing. Was this work also like uh helping you like go into that same direction? >> Yeah. So that was a big influence um my hourly razor paper was a big influence on SDFT. Basically I finished the hourly razor paper and our conclusion there was that on policy method tend to forget less compared to like offline method like SFT fine tuning and I was like okay that's very cool we mainly focused you know on like RL as the main on policy method that people use but I'm like okay my conclusion in my research can't be oh let's just do RL and throw any other kind of learning signal to the garbage you know that's not something that is like there as you said there's so much learning signal as like expressions and int and >> feedback and so on and I'm like okay we need to find an algorithm that is like >> on policy like RL but is able to kind of like learn from very diverse kind of signals um and that's where I like started to work on this idea and using in context learning you know with distillation to create the self distillation and actually like allow me to put a small story here about how me and Yona started to collaborate on that >> go ahead. >> Um so we know each other from conference like few years already and we met at Nurips's um this December and we sat down and you know as usually like we kind of like update each other on what like exciting project you were working on and you open his laptop and show me some slide on like it was the beginning of this SDP work and I'm like no way that's the exact formulation [laughter] I came up with and I'm opening my laptop and showing him like you know the beginning of like an overly like few paragraphs of ideas and few experiment like okay like we both stumbled on the same like you know idea here um different perspective I'll say like Yonas was focused more on the textual feedback that led to the STPO paper I was focusing more on learning from expert demonstration that led to the SDFT paper but that's how we are like okay we are working on the same core algorithmic innovation just >> collaborate that's what I found like so exciting at that time Because I think we were both thinking and part of the motivation was that this could be a learning paradigm that could enable learning really from very rich diverse data. But this really showed it to us, right? Because we basically arrived at the same underlying learning algorithm but coming from completely different directions and working on very different problems. >> Yeah. And I I mean like um because I I I I also really like to like learn from the old older paper, not really for the knowledge because like uh when you when you you push too far down there's a lot of noise because they're going to say like oh we think this and then you read it like 30 years later and like dude no like that was not it. Right? But like you you see some of these convergence of idea brewing that you know that in like a year from now they're going to get it right. they're going to get around like the a very good example of this is like just like the all the ResNet highway network kind of moment in in machine learning because there was all this convergence of ideas and then like the every time like it's like different lab that don't necessarily talk to each others and then they don't necessarily like also reference each other but then they come into the same direction and then the kind of paradigm that usually win is like the one that can distill the idea to the simplest atomic unit Right. And then this then get like tested on in like in in many different form. Um which is kind of what I I felt here because like you're in MIT he's in ETS Zurich. Um I was trying to find a connection and like hey wait so like these guys are not 100% related. I know this this was also exciting validation uh from like a historical perspective. Okay cool. Uh this is pretty good guys. I want to say something about your previous point regarding how ideas kind of converge. >> Yeah, >> I think like ResNet is a great like example. Another great example is attention, right? Essentially all you need is not the first attention paper, but it's you know >> there are many. I think when it comes to um our work on self-distillation there is a core idea that I start to see like coming up recently and uh Omar Katy with his RLM work like also kind of push that which is that we got to the point to where models can start to be the force that push their own learning algorithms like they have enough capabilities that end you talk to JGP which means that like we should not train them the same way we train three layer MLP should use their own capabilities, their own reasoning to drive new learning paradigms. >> Exactly. And I think like like two three years ago like in context learning like it was I felt it was something very powerful and it tells you a lot about the model, right? Because it knows a lot and he has like learned all these function that it can recombine. So the theoretically if you give it like some some like a direction and new information it's not like it's not a dumb rock like it will be able to it will be able to kind of go and and and and go in a direction and you like these model that we using cloud and stuff like as soon as they go and then they search information on the web and stuff like it's pretty good like it will it will be able to like recover and then and iterate and and do their stuff. Um, so if there's a way to just like make like it's more of like allowing them to do the learning, right? A bit like Ilia said like these models just want to learn like just allowing them to have the chance to like incorporate the right signal and the learning. I think that's kind of one of the the bottleneck and if you allow them to do that then they they're able to learn a bit like reasoning. like a reasoning it's like in my view like as you just start to loop back and back and let them like just do their stuff they kind of like were able to lift a constraint from themsel which is like I'm going to get an answer and then that's it and then we're done right it's now it's more like I'm going to get an answer and then like put out some scaffolding and then like use that scaffolding to be able to generate the right stuff and then then I output the stuff there's a lift directly in performance all across because kind of like this constraint that we've put on the on the main structure is lifted. I feel it's it's literally the same thing. It's like this constraint of like okay now what if the context window reset oh you have to do this again eh but like now it's like yeah you were able to learn from these and then you were able to move your weight into kind of like a a peak in in a value that like make more sense for like this these specific tic user. uh this makes a lot more sense because like you don't have to mess around with like crazy memory scheme and stuff like uh it will have a better sense about like what to do next. Um and if like you like if if you you weave it in so that like there's there's not too much catastrophic forgetting that happened, it's a very viable in my opinion kind of pattern uh to have these model um have. >> Yeah. So we put together some slides. Um what we'll talk about will be a little bit of a unified perspective of these three papers which we put out which as we discussed they're all on the same algorithm. The algorithm is exactly the same in fact but they touch on three different perspectives of how that algorithm can be useful. And really special thanks go to Thomas who has also been leading one of these three [snorts] papers and a super exciting one and we'll talk also we'll cover that as well. What I would like to start with is this observation that in many ways current learning paradigms are imperfect and that surfaces in different ways. So as we talked about one big issue in many learning paradigms is that they lead to catastrophic forgetting. So meaning as you try to learn a new task, you have this catastrophic behavior that you become significantly worse on previous tasks that previously were good at. And that's obviously not something that we want. Then the second problem I think surfaces in a little bit more in a little bit more subtle way which is that most current learning paradigms they require very careful designing of the data that you put into them. So this could be SFT data sets or this could be RL environments. And I think this shows predominantly in the fact that there have been a surge in startups or even like scale AI now a bit older and already acquired. really a search in companies that whose primary aim is to organize um the data and I think a dream of many is to have eventually a learning paradigm where the models are able to make sense of the raw data itself as they interact with the environment and go out in the world and find the data. And then the third one of course is this observation that often systems have some kind of brittle generalization that if you prompt them in exactly the right way they do what you want them to do but if you just change your prompt a little bit they seem not to be able to do what you want them to do. Meaning somehow they have this quote quote like jagged intelligence. They didn't seem to fully internalize what you were aiming to teach them. And so what we instead want in a learning paradigm of course is we want some system that is able to continuously improve forever without degrading. We want some system that is able to learn from arbitrary real data not some handdesigned or filtered data. So really real data is you just deploy it in some environment. And we obviously want systems that generalize. And so the thing that we will we will talk about is this thing called on policy self-distillation. And we really view this as a new learning paradigm that enables continual learning from arbitrary data and makes some improvements on the three dissidorata that we outlined. >> What's your take on forgetting? Because like continuous learning is like is one thing but if we're talking about like a system that kind of is is learning from your your your pace and like your way of working here from like a human or like a process. Uh but like I said the process in the human is flawed right and then like okay if not finally like that was the wrong direction we need to go there. What's your your take on on that stuff like the the forgetting the ad adaptation uh in general? >> Yeah I can answer on that. So for me at least like the ability to learn without forgetting is almost a must in the world where you know we have AI models that are actually being deployed like think about let's say cloud or chip like according to recent numbers they like process around 3 billion messages a day not learning from these messages it's you know it's a waste of data like user interaction environment interaction going to be the largest source of data that we have to train models Um but unlike other sources of data which we have you know in some like you know fixed container and we can choose like when and how we we can use this data is a streaming data. Every day you get it a bit by bit and therefore you have to do like you know this kind of continue learning which is essentially learning without forgetting. you have to be able to update yourself every day a bit becoming a bit better aggregate another new skill another new capability um without like you know start to degradate on what you know already uh this is true in the general sense of you know let's improve the model from three billion messages and this is also true in the small sense of if I have my own you know open cl and I want my open cloud to be like tailored to my preferences um also it will not get like one big data set of my preferences it will get every day I'll tell it a new thing that it need to remember. Um and this is just very different than the way we think about you know classical machine learning when we start oh assume a data set ID from some source you know and therefore forgetting is something that really need to be dealt with. Our agenda um for this talk we'll try to keep it somewhat short is first of all to cover what is self distillation what is the core ideas behind it and to kind of goes through the different parts of self of post training that we believe self-distillation can take part of which to be honest is most of self distillation so we'll cover stuff that like how self-distillation can be used to learn as um from demonstration as a replacement for SFT from verifiable build reward replacement to RL and also to learn from new kind of um learning signals such as rich environment feedback and real user conversations that are not able with the current methods. So let's start with like in context learning. So in context learning um I think these days we almost take it obvious like okay everyone know that it's happened but like I still remember 3 years ago when like people start to realize that this has happened and it just seems magical like you just put data into context and the model just change its behavior and is able to learn from whatever data you put into context and it doesn't have to be really nice form like examples. It can be hints, it can be feedback, it can be random instructions. And it's also generalized very well. But, and this is a big but, it's very transient. Like if today I put some context into the model, tomorrow I start a new session and the context is not there, the model goes back to how it was. There's nothing that just stays there. And the second is that which is another big limitation is that context windows are bounded. And in the end if again I want to become better at every possible task on errors I cannot put examples for every possible task in context. And therefore like when we looked in in in context learning we're saying okay we have something that is really good at our hand but we need to somehow compress it. We need to somehow take it and put it into the model weight to make it stop being the transient and become something more like that stays there even when we remove the context. And this led us to the core idea of self distillation. I'll now cover like briefly the algorithm itself. We take an LLM and we operate it in two modes. The first mode is the student mode. We just get the input prompt some question from a user X and it's output a response Y from a second mode is a teacher mode where in addition to the input prompt the same LLM also get an extra contact. See this extra content again can be expert demonstration, instruction, feedback, whatever. But the important thing is that now the model is conditioned on another input the context and this automatically change its output distribution. Now the output distribution the responses that the teacher would have produced uh if we sample from it are kind of different from the student and therefore the coron self distillation is just to use this teacher and do teacher student learning all paradigm in machine learning where you just minimize the some distributional measure and this in our case we chose the reverse scale um between the teacher and the student um and take the gradient of course only through the student and not through the teacher because the teacher is the one that guided the learning. This is a distillation algorithm uh very similar to what people have been using quite a lot big to small model distillation but with one important change that the teacher is the one that is changing it. Um the the model itself is the one that guiding the learning. Um and why we like it is that this is much more similar to how human learns like humans don't learn by looking at some example and just trying to mimic it you know onetoone. Um what we usually do is like kind of we observed we condition our short-term memory condition our actions on whatever we see in front of us whatever feedback we got from the environment and we use that to improve our behavior. So kind of the force that push the improvement in humans behavior is humans themselves and we want to give models the ability to do the same. at the same idea let's say like following this idea how do we do it in practice given the teacher and the student we can compare their output distribution for every single so we go over some assistant answer in this case um the user said answer was yes or no is water we get and the assistant produced an answer that is much more than the user asked yes the water is wet and the user said and I said yes or no only um in hindsight if we'll take the full conversation put it context and ask the model or like let's go token by token here over your assistant and see how now that you know that the user said afterwards I said yes or no only the probabilities will change stuff like yes will become more probable in handsight and everything else become less likely in handsight and this is the core uh algorithm object that we're working on with uh the log probability ratio between the original model and the improved policy. This is actually very interesting because when you take this log probability ratio, there are two perspective that you can use about how we can use it in learning. One is that we say that this is just a token level advantage or a token level reward. This is similar to how in normal JPO where we train LLM the advantage is the you know the reward of this answer minus the average of the rewards across all answers. Here we just say okay instead of using that signal let's use another signal but plug it back into the RL algorithm at the same time we can use look at it as distillation algorithm where we want to if we sum this log probability across the entire like token space we get just the KL measurements um between as I said before the student and the teacher now one nice thing that we like about the um self distillation perspective is that unlike RL where you take you get feedback only on the tokens you sampled. If you take the full KL you take you get feedback you get a learning signal over the full token distribution which is much much more informative and uh you know if we had the whole conversation about like how normal RL RLVR is just like one bit of information per trajectory. If we do this like rich advantages, we get more bits because now for every token we get some feedback. But if we do the full KL, that's a very rich signal to learn from. And we'll see later how it also leads to um faster convergence. So another nice thing is that we said okay we let the model decide how to learn and when to change its behavior like how much can we trust it like how maybe this process is very noisy. So in one of our works led by Thomas um he compared basically um different trajectories. So what do we have here? We have like each one of the tables with the colors you see in front of you is a table of trajectories. One each row is a different trajectory and each column is a token. So we have 20 something trajectories with 128 tokens each. And basically we said okay if the and this is similar to before we have like a multi-turn uh conversation with uh ani with the user give feedback. Now sometimes the user give relevant feedback relevant user follow-up and sometimes the user ask an unrelated question after that and one nice thing that we noticed regarding in context learning in this experiment is that the model will decide to change its original answer only if the user gave relevant feedback. So you see above basically the color is corresponds to how much a log ratio change uh the probability is changed which means that the model decided to um that there is another answer that will be better given the feedback and you see that there is much more color on the top plot than on the bottom plot. Basically, if the model didn't got any relevant feedback, it will not change its answer. Which means that this incontext learning only push the learning only change the model only when we have relevant feedback. >> The whole system seems to hinge on in context learning ability of the model, right? Like it's it's leveraging this in order to kind of like guide the trajectory. And and one of the question was like what if the model is not a able to kind of understand the correct answer right like a how to to like I don't know like a it's not necessarily like not have enough capacity because they're too small but it just doesn't understand what what is the correct answer it will still kind of judge the token output like is this going to lead to kind of just the the the direction being like random or like what happened in this specific case >> yeah in that case if the model like understood that there is some feedback that it should change its behavior but it didn't really understand how and like you know it's able to push it into a good direction. Uh then yes like the update can be quite random. The nice thing I'll say is that one um when you take you know usually you do like training over a whole data set of examples even if some of them the model was not able to understand you will be able to understand on others and the you know average gradient will still push you towards positive directions and second that like look at the models that we get every year we have new models that even with the same number of parameters just have much better capabilities. is. So if I compare, you know, these days Quinn 38B to like the first Lama 8B, the in context learning capabilities of the new Quinn are much much better. And I expect this to continue like growing like the new model of next year and the year after that will just have better and better in context learning capabilities and we'll be able to understand users better. >> Right. Maybe one thing I would add to that is I think the other aspect that this points to which is quite important is the feedback itself. And I think in the same way that as humans right we also rely on the feedback that our environment gives. I think to me this points to a super interesting field which I think will open up in the future is is the question of how does the model best like seek out the right information from this department as well as right in the same way that when we want to learn about a particular problem we have to learn about where to look where to get guidance from. If we search for people that help us Yeah. people that help us become better at XYZ, we try to find the right people that actually give us the feedback that we that our policy needs, you know. >> Yeah. >> And I think that will be super interesting in the future >> as a human like uh whenever I'm trying to learn and I had to to learn something like the hardest learning I I had to do it was not because the material was hard. It's just because like I had to kind of fight to get a signal, right? Like I just had the end signal and then I knew I knew it was like something that I had to do a lot of work to to get to, right? And sometimes like there is nobody to ask for. You just have to try a bunch of stuff and then you get like some more kind of uh idea about what's going on. But sometime it's just like it's this the environment was not well set to give me the signal. But as soon as I get the signal, I'm like, \"Okay, that's it.\" I just I just like get another textbook look at it and like okay that's just that's just the stuff it's not it's not that hard. So yeah it's true like the the um the the richness of the the feedback is uh is really important for for learning like in in organic being. I do agree and maybe you know the same way we see like a lot of uh environment design that go into RVR these days we'll see a lot of like environment plus feedback design that go into like training data loops in the future. Um we have one um Rubni Carmona uh he's asking like does this paradigm skew the model representation toward meta condition without the substance of actual condition and I ask him like what he means by actual connection and he says like if the representation of a task or ability a get progressively higher while losing the initial specificity he's talking about like if the model is not like losing a bit of specificity to kind of go a bit more higher level >> I think there are two kind of loops here there's the metacognition loop which is like kind of like like equivalent to in context learning like how much the model can learn from feedback and kind of like guide itself and there is the inner loop of the specificity which is like how much it can actually solve a problem like solve a question and I feel like here we mainly take advantage of the meta cognition loop to improve the specificity I believe that there are other works and this is a very exciting like line of research to see how we can improve also the outer loop how to make the model more aware to the fact that it's being used in this kind of like you know self distillation training so that it will be able to provide um better signal for itself actually like um Joe Parry from my lab here at MIT released a paper around a year ago about self adapting LLM that did exactly that basically an outer metacognition loop and inner RL loop where it's basically train the model to give better feedback to itself for training. >> So I believe this is like combining that with a self distillation kind of algorithms. So back in the self adapting adapting language model you did mainly supervised learning in the inner loop but combining the two ideas of improving the outer loop and improving the inner loop can be quite powerful. What are you guys thought? Because we're talking about continuing learning. Maybe you're going to you're going to talk about it a bit, but like a very very long horizon hard task where the feedback is rich, right? Like you are able to get some feedback about like if you're going a bit in in the right direction, right? What do you think here? uh would it would it be a useful kind of methodology to kind of refine the model uh as it is like going through the context in this let's say in RS like RM or whatever it is and then is able to kind of without blowing its context window uh manage it properly but like kind run these experiment get some feedback and know if like it's going into the right direction and like inject this into its weight do you know that this do you think that this has like some utility and like this more automated researchish like a long context um hard task. >> Yeah, I think we'll we'll have some early examples of that later uh that we'll that we'll cover. But I fundamentally agree I think 100%. Um to me something that also seems interesting is it seems to me kind of intuitively that as humans we do this reflection on various let's call it horizons on various over various frequencies both in terms of very like narrow and shortterm feedback that we get from our environment as we interact with it like immediate responses that the environment gives gives as well as over longer periods of time where I don't know we recapitulate. Okay, now that you know I talked to my supervisor about what I've done the last two weeks and he told me XYZ, maybe I [laughter] should have done something different. >> Yeah. But that that's what was in my mind because like um every time I was doing like every time I was doing research like looking back when I was done I was like that was like max four weeks worth of work, right? But throughout I was getting all sort of the signal that I had to untangle and try to cobble up the feedback in order to figure out like the direction like and like I I knew when I was off off, right? And the direction even though I was dumb, right? I knew like this was not the right direction for sure, right? So I had to kind of like go over there and kind of mess up a bunch of stuff here, get enough feedback like hey it's it's here and I just continue and then I got the stuff and then when I link them up it was like three four steps. Um but I was able to do it but I needed the feedback from the environment. So this is kind of what I was thinking like in a in a setting where you can get even if it's complex feedback like very very detailed one um you might be able to kind of steer the model to just have in this hands- um direction right uh because like let's say there's a discovery that we want to make now the model has it in its data set and it was trained with it knows right now it knows right but before it knew right it's not terribly different the model from like now than than like when it was trained. So being able to kind of get these signal there, maybe we'll be able to arrive at these kind of result um if the feedback is is rich enough. >> Yeah, 100%. I think it's a very very important point that you make there. To me, it's certainly the case that especially as we do as humans reflection over longer time horizons, we do this form of meta update of our strategy where we say once we made the discovery and we reflect on how in the future we can make these these these discoveries better. We don't say okay in hindsight I should have just you know straight oneshotted my discovery. What we do is we try to understand okay what kind of systems can I build or how can I improve my of way of doing research so that I will make discoveries faster and we are aware right through some form of trial and error and reasoning that this does not come through better one-shotting but through building better kind of systems and a better ways of probing and getting to answers quickly and so yeah I think like that's super interesting I'm very sure that over this like over this year and and also longer term we will see much more um much more results in that. So before going into like specific examples of learning uh we want to touch back into that our density rata basically what we expect from a true continer learning algorithm um and see how the algorithm that I just described kind of like check any one of these requirements. Uh so we as we said before what we want is basically performance. We want an algorithm like the model to learn. Uh we want no catastrophic forgetting and we want to be able to learn from arbitrary non-corated data. Um let's start with performance. Um so as I mentioned before our algorithm is an on policy algorithm which means that it's a feedback style algorithm. The model the student in a student mode roll out its own trajectories. it's tried to solve whatever task and going over this trajectory it's get feedback. Uh now there is very classic results um from 2010 um from the Dagger paper a very famous paper in robotics um that show that basically if you do on policy learning you'll supposed to get much better just performance in your task and why is that let's look at the other option to do like teacher forcing basically to just look at teacher demonstrations and to understand the pain point here let's look at this like two example of like autonomous driving. So we have this car driving this loop and it's follows like it have a bunch of expert demonstrations in blue here and it's trying to learn from them. The problem is that all of these expert demonstrations are only of a car driving quite well in the middle of the road. What will happen that during inference time you know you deploy, you learn, you deploy the model. Now the model is not perfect and sometimes it will get very close to the wall but this area of being very close to the wall has no coverage in your training data. Basically it's creates some train test distribution shift between um the learning the training and the inference um and you get to an area where you didn't learn over and basically you're screwed. The model does not know what to do. Now this is kind of extreme like examples but we see it's also true um in any kind of learning as long as you have this like sequential decision making this you know when the model output a sequence of actions or a sequence of tokens as in LLMs if you just follow the teacher you get a lack of robustness that lead to just poor performance um and this was known also for like normal on policy distillation versus offline distillation from a big model to a small model that on policy distillation just give you better performance and therefore we that is one of the reasons we chose to focus on on policy self distillation another one and I think we touched it a bit before about my hour razor paper where we showed that on policy learning tends to forget less so we did a very like simple experiment where we took the same task the same set of prompts and train a model with either RL or SFT and measure not just how much we improve on the new task as we see in the x-axis on the plot on the left but also how much we forget. We took a set of eight non-benchmarks like ifal and truthful QA, MMLU and so on and we basically evaluated like how much degradation of capabilities we have when we learn a new a new task in this case tool use and something that we found is that RL tends to learn without forgetting while for SF in order to improve your model you need to sacrifice prior capabilities and this is by the way not only true for LLMs we try with robotic foundation models and even with three layer MLP on amnest and just to try to give like a short intuition about why is it happen is because that in a lot of these problems there's not only one policy that can get to let's say 90% success on the new task. It's a whole set of them and which one you converge to will affect how much forgetting you'll have and RL just have a tendency um like on policy methods such as RL have a tendency to converge to one that are as close as possible to the original policy even without any explicit KL regularization or something like that. So this kind of like um implicit bias towards like minimum change keep the models from forgetting more and more and we took advantage of that and said okay if our self-distillation is an on policy self distillation it will also forget less than if we'll do like let's say offline self distillation we just generating data from the teacher and learning from them and so like we know that like we can get good performance And we can learn without forgetting it. Another the question is like can we learn from arbitrary kind of data and I think this is what we really work on quite a lot in our series of papers to to show that this is possible that this is not just like a one algorithm that learned from one type of learning signal such as verifiable rewards but this is like a really core and versatile kind of learning paradigm. and we go through some of these examples now. Yeah. So the first and the most I think intuitive one um that we like to start from is learning from demonstrations. Basically you have the expert demonstration you can put it in context um and ask the model to you know the teacher model to produce its own response which means that given the condition and the right response it will be a true one. And this is like a different use of that. This is the same data that we use in SFT but a new algorithm that learned from it. And as we see in the plot here on the right, we see two things again. We have on the x-axis the new task accuracy and the on the y-axis the prior task performance. And we see that SDFR self-distillation algorithm is able to both get better new task accuracy again because of that train test distribution mismatch that does not happen in policy learning and learn with almost no forgetting which is exactly what we want to get from a continual learner algorithm. And this also enable us to do what we call like a real continual learning experiment. basically not just improving taking one task and saying oh look we learn without forgetting but actually taking a model and trying to like improve it on a series of tasks. So we have tool use scientific QA medical questions and we're asking okay can we get a model that just goes data set by data set and improve on all of them and here uh we see the on policy service distillation is actually able to do it like you every time you change from one train that data set to another you have a slight degradation of performance but overall you are able to retain and aggregate. uh this is unlike supervised training normal SFT that once you change one data set to the other the performance on the data you learned before automatically dropped. So you are able to only learn one task at a time and not really aggregate. Um and this is for me at least very exciting because this points to the real you know continual learner that we talked about and think that it can be possible. Uh I really like that that figure because like it showed um really clearly like the difference between the two. Um but uh I was really curious like it just gut feeling here, right? How much data set diverse data set do you need to pile up here in order to make the on policy selfation like a crumble? like do you think it it will it will show like massive degradation at some point or you you still think it's going to be uh robust? >> That's a really good question, right? We still need to like push it to in scale to see where things break. That is kind of like the downside of doing kind of academic research where at some point you're saying, okay, I ran out of compute. But I do think that over like my closest point of comparison is I think Nvidia had the Cascade Neotron paper where they did RL on a series of tasks and if I remember correctly they had six different task and each one of them had was much bigger data set uh than the one that we used. you know if they did to use they did like a proper you know like tens of thousands of tulus problems and there with on policy RL they were able to show that like you can keep learning and basically have no degradation at all even after like if I remember correctly like thousands of GPU hours um and since like the core mechanism the on policy like learning mechanism is the same mechanism I expect SDFT to do the same um being able to just keep aggregating performance. Of course, you can see at some point you have like when you we move from tulus to science for example, you see the tulus plot drop a bit. Yeah. >> By like around 10 15%. >> So I feel like this will happen anyway. You know, it's kind of you move from your local minima to a neighborhood that is still good but not as good. >> You you you kind of sacrifice a bit of like specialization for like generalization which is which is like fair game, right? I think it's it's good because even with human like one guy that just like look at one thing right versus like somebody that has like 17 different skill set uh okay like the generalist will be like more useful generally but like when you get to something that is that need require like just doing one you you're much better getting a specialized like person but like looking at this plot and then looking at the run on the right I'm thinking like like during the the I don't know like the frontier pre-training post-training pipeline >> that like does it make sense to we weave in RL a bit earlier or like like self distillation like this earlier in the process versus like just jamming supervised fine tuning in there. >> I I think it completely makes sense and this is actually correspond to what we see frontier frontier labs start to do. Basically if you look at like recent paper you see that they kind of took all of their SFT data and pushed it into pre-training. So now pre-training is kind of a mixture of like random internet data and a very well structured high quality SFT data sets and so this all happen at once. So you don't have this sequential training because they know that the sequential training is bad and then after that you just move to a big RL training. >> Yeah. >> Um and in the RL training since it's on policy you can allow yourself to have different phases right math and then code and etc. But at least they stopped doing like a separate SFT at least some of them. >> One one caveat which I would add here is that in some cases you might do some expose yourself sequentially to environments which partially contradict each other. >> Right? We have one example in Thomas's paper actually where you interact with users and it can be the same user and they preference changes. Right? But in the beginning that user wants like u very detailed answers and eventually that changes to very concise answers. Of course, there's no way to be good with respect to both. They're very they're contradictory, right? And so I think to some extent that's that's driving most of the degradation you would see in on policy methods as you stack up your training. And so a question if you are a model designer is either how do I sort or organize my tasks? >> Yeah. um in such a way such that you know I minimize the adverse effects so that I kind of prioritize my tasks correctly so that I train last on the tasks which I want to prioritize most or alternatively and this is I think what I've seen from a lot of the recent frontier labs is that they have some relatively small multi-stage training step at the end where they mix so they do is sequential on policy learning for the bulk of the training and then they do have a very small stage where they take some data from all of these different environments and then train with respect to a checkpoint. They do on policy distillation with respect to a checkpoint that was already good at that task >> to just it's just a way of reweing your priorities at that point. It's not a way of of trying to learn a new a new skill anymore, but it's just a way okay I maybe I first trained on tool use but then in the end I trained on science and medical but it's actually not the case that I want the model to focus on medical and kind of sacrifice its tool use priority but I want to have a more balanced >> I think like it made me think so because I've talked to the founder the sorry the head of AI at Matt Academy and he was like he's very heavy on is this is human stuff, right? But it's very avon like having this hierarchical kind of learning paradigm which is kind of like the same thing here like you you would like put the the different training blocks in a way that makes sense. So if tool use is going to be needed in one data set in the environment and data set later on you much better learn it here because otherwise you're going to learn tool use while you're supposed to do like browsing and and stuff like that. So I think that makes sense and like I really like the idea of I don't know letting the model figure it out a bit right like you you look at the different kind of data set and then try to like uh conceptualize like yeah I know about these topic roughly I'm going to organize it roughly like that for now and then go into the the sequence. So the second kind of learning that we'll touch upon is learning knowledge. And this is really cool because knowledge for example is something that you cannot learn with normal RL. It's kind of like doesn't matter how many you know like what kind of verifiable reward I'll have. If for example I'll take a model with uh 20 end end of 2024 knowledge cutoff and we'll try to train it on a data set of like what happened in 2025 as we did like no matter how many you know how much exploration it will do how many guesses it will have what will be your group size it will not be able to kind of like guess correctly. So this is basically a kind of learning that we're not able to do um with on policy algorithm before but now with self distillation it is completely possible. You just put the text of whatever new information new knowledge you want to give to the model. You give it to the teacher. You put it in context and it does just train itself. We compared it in our paper to um few baseline. first of all a normal you know like in context learning or rug um we also compared to CPT continued continual pre-training which is a very standard of just next token prediction on the document and also what we call like supervised training but is like kind of self data a self-study algorithm that was popularized recently by Microsoft which is taking the document generating a huge data a set of synthetically generated questions on it and then train the model to answer on them using normal SFT and of course we compare it to our on policy self distillation. So the base if we look at the result the base model of course is not aware at all of the new like information from 2025. So it's get zero boss in like what we call accuracy both strict and lenient accuracy just we ask the model questions about this event that happened during 2025 and expected to answer and also on all the accuracy what is all the accuracy here we checked a bit for generalization so this is not like a direct question about you know uh what happened in this storm during 2025 but it's more like a questions that need to use the knowledge from in a more general aspect. For example, what were the 10 biggest natural disasters in the last 10 years where the new knowledge should change the way you answer them because some of these 10 biggest natural disaster happened in the last year but it's not directly you don't like you know it's kind of you don't ask kind of the model specifically on events uh which require this is a way to see how much of the new knowledge was actually incorporated into the what the model actually knows about the world its world model and then we compare it to like CPT uh which do terribly like um barely able to learn anything SFT which we know like from previous work they're doing kind of well I able to get like around 80% strict accuracy but doing not that well on the OD accuracy so basically if you do SFT you kind of memorize answers um this is a very known problem with like offline algorithm where you do teacher forcing Well, when do you do SDFTR algorithm you are able to get better accuracy but also mainly better OD accuracy which means that you actually incorporated the new knowledge into the model. So to preface that I want to kind of start by discussing in which ways on policy methods or current on policy methods have been bottlenecked and that touches also on what Eden mentioned on why for example they would not be able to learn knowledge. So the first way in which they were bottlenecked is that they methods such as GRPO or general methods in RLVR so reinforcement learning with verifiable rewards receive only a scalar signal per roll out from their environment. And that of course bottlenecks how much they can learn from their environment. For example, if you want to learn knowledge through a scalar signal that will be very difficult. And then the second uh bottleneck is that then this already weak signal is used for rollout level credit assignment. Meaning the policy is not trained in particular on a specific token or shown that a specific token was good or bad but it's just being shown that the entire rollout was good or bad. So there's here's one example that illustrates this and we also had this in one of our papers and the example is uh the question is write a Python function that returns all numbers from 1 to n answer briefly and then in a normal process and also in our training in on policy self distillation we would sample a response from the model let's say it would be this Python program and as you can see it returns a list that ranges from one all the way through n so including N and then a feedback could be not to include N. And so what would GRPO do or RLVR methods typically it would so in GRPO's case it would just say all tokens were bad. So it would receive a signal a negative uh reward or at least on average with respect to the rollout group might be negative and so in that case then it would just downweight the probability of all tokens. make would make all tokens less likely. But instead what on policy self dislation would do is would look over the generated tokens and for each ask in hindsight given that you know the feedback was not to include N what would you have still generated that token and as you can see here STPO and this is a real example this is with quen 38B STPO here doesn't change any of the tok next token predictions for any of the tokens except for this one token plus that follows the N right because that plus token was the cause why we included N. And so this this is basically the intuition. So both STPO uses this richer signal this textual signal that describes okay we don't want to include N in our list that we return and it produces dens densor credit assignment because it actually says okay here this token you should change but as Eden also mentioned before it doesn't only do that over the generated tokens it does this over the entire vocabulary so at every next token prediction the hindsight policy or the teacher asks okay any of the possible next tokens here at that position should they now become more likely or less likely. So then in this example you would get let's just look at the position of the plus because that's uh maybe most intuitive. You would not only say that the plus here should become much less likely. You would also say the alternative how you should have continued at that position should be to just have closed the list right and completed and that becomes much more likely as is indicated by this blue color. So how does this manifest? So if we look here now as an at at an example where there's this type of behavior but aggregated now across multiple rollouts you see this interesting pattern. So, so GRPO basically asks was a response better or worse than average across a group rollout and then would make all tokens either more likely which would be blue here or less likely. And in contrast, SDPO how we called the um on policy self distillation in this RL context would ask was this particular token good or bad in hindsight given the additional context and then the policy would comment on each individual token or even on each individual next to possible next token prediction. And so one natural setting that we looked at was the RVR setting. So the typical setting where RVR methods are applied such as GRPO which is the setting where you learn from success and failure where you have some environment that just tells you whether your response was correct or your response was wrong. Our first question was okay even in this kind of environment where where we don't get any additional signal from the environment can we still benefit from better credit assignment and so how you would do self-distillation here is very simple so you would do similar as in GP you would do multiple rollouts per question and then you would have some correct attempts and you would have some wrong attempts and then you can just put the correct attempts in the context of the teacher for the wrong attempts in the same way as Edan described how you would use self distillation when you actually already have correct demonstrations. Here you would use these correct demonstrations but these would be generated by the model itself as it explores in the environment. And so here's one example. Um this is on a chemistry reasoning data set. And what we saw here is that we saw two things. We saw that SDPO both converges here to a much higher accuracy but it also converges much faster in training wall clock time which this fast convergence we generally saw because as we discussed it just provides much denser credit assignment so a much richer update signal and another thing which I know we don't go in here right now is that we also typically saw that it learns more efficient reasoning so it uses more or it uses rather less of these typical ical reasoning tokens such as H and weight etc which GRPO tends to produce >> and um another thing that I think I've seen in the paper is that like they're also way shorter than like um the GRPO one uh do you have any thought about like why exactly like I know that like the the just standard GRPO formulation like when it's wrong it's going to be wrong for longer and stuff there has bias for that uh but why is it like that much uh more efficient shorter. >> Um, yeah, it's a great question. I think it's um it touches on what we discussed earlier, namely that in hindsight when you kind of critique how you would have solved the problem, you you often think, okay, I could have shortened my response like there was a more direct path to responding. And so to some extent I think it's a function of this hindsight policy being able in a more informed state being a able able to tell the policy okay here this was like a circular reasoning loop that wasn't unnecessary and it's penalizing the policy for that at the same time though um I think it's also the case that this points to a particular problem in GPO which is has also been widely studied at this point that even though you know I think parts of the core intuition when GRPO was first introduced and then also when the deepse R1 paper came out was this idea okay wow we can see that our policy learns to produce longer and longer responses and has these kind of h tokens and starts to actually quote unquote think a lot of work since then has shown that you know often these like not always but sometimes these think additional think tokens they're not necessary for good generalization And they are kind of I think they're surfacing this form of weak credit assignment that GRO performs in the sense that GO will just upweight things that tend to work better on average on a very coarse average which are then sometimes these approaches which just do think things five times over in the same way. And we do saw we we did saw this in the paper that we got the circular reasoning in GPO even explicit circular reasoning where GPO would or the model trained with GPO would say I am running like I'm I'm running in circles or or things like that. Maybe one thing which we don't have to go in that much detail because we already covered it at this point which I wanted to mention is that what we generally saw in the RL case but also in the um in the cases that Eden described before where we compared against SFT when learning from demonstrations is we saw that as you scale models you get better in context learners and those translate to better self- teachers. So better teacher signals that then lead to better student models as we train them with on policy self distillation. Maybe that's a very intuitive thing. So maybe the the one major thing which we want to cover which we're very excited about is how these methods can really uncover new data or unlock new data modalities that we can use for training. And one very natural one is this idea of learning from which rich feedback. So here's just one very practical example to illustrate this where the question is how much impulse did the thrusters generate for the Mars climate orbiter and let's say the model answer would be 100 pound for seconds and as I mentioned in normal ROVR the model would just receive a binary reward and so in this case because the answer is wrong it would be just a negative reward but in many cases in in many real environments you would have some richer signals some denser feedback back loop where the feedback would be much more informative as to what the model actually did wrong. So here in this case that it should answer in Newton seconds and there's many examples of this kind. There's and two examples we looked at primarily there's code environments for example which produce runtime errors failed unit tests etc. And there's real user conversations and I want to focus maybe on the real user conversations. So one really nice experiment this was led by Thomas um was but we asked okay can we learn from raw user conversations user conversations in the wild and can they actually improve models and what we did here is we took 14,000 real world user conversations from a data set called wild chat which was produced some years back by Alen AI and then what we did is we did kind of the natural thing which we discussed we um split this into triplets where each triplet consists of the prompt or the history of the conversation up to that point. Then the model response and then the follow-up user response which would indicate maybe what was an issue a possible issue with the previous assistance response and we would train with on that with on policy self-distillation and so what we then did is then we evaluated the trained model on a diverse suit of benchmarks. So for alignment, for instruction following, for reasoning, for creative writing and also for knowledge here with MMLU Pro. And what we saw here is that on several of these benchmarks and here in particularly for alignment, uh reasoning and creative writing, the model substantially improved here, this is with Quen 38B, but we also tried this with other models and this was very surprising to us. Why? Well, because this data that we trained on is quote like in some sense free data. As Edan also mentioned before, this is data which we just get through running our deploying our model and letting it interact with users. It's not the the user conversations are raw and it's a very weak data set because it's only 14,000 conversations, right? Because this data has not been so useful in the past because we didn't know how to learn from it. We didn't really have good data sets that collected a lot of data. And of course, the big companies that perform a lot of inference, they have a lot more. So most computer day is spent on inference but we didn't know how to leverage this interaction for training. And what we're excited about in terms of this result is that it seems that on policy self-distillation enables scalable learning without requiring explicit rewards just by raw interaction with the environment and receiving textual feedback. So this is like one example of how you can learn from this raw feedback from humans but it's like a cross population of humans, right? So it's still a typical post-training objective. The one last thing which we wanted to discuss here in this in these slides is how can we use on policy self distillation toward for a really quote unquote continual learning system that is deployed in the wild and I'm going to give a few examples of what does that even mean and how could such a system look like. So one early example that we started looking at was how can we discover solutions to very hard problems. So we didn't want to go with super hard problems. Um so what we did is we took problems from coding task from life codebench that the model was not able to solve across a lot of attempts. And then what we were interested in is how quickly would the model discover the solution to that task. And how we quantified that was through this discovery at K metric. So this is just the probability of solving a task within K environment interactions. Right? So this is what I'm going to plot here. And the tasks we considered here were really hard tasks. So these were tasks where the pass at 64 was less than 3%. Meaning that if you were to 100 times each time sample 64 solutions, only in three of these 100 cases, the model would actually have sampled any correct solution in these 64 attempts. And then the the simplest baseline is is what's typically called best of K which just repeatedly samples um solutions from the base model and we wait until it has found a and so actually for best of K this discovery at K metric is is the same as pass at K because here there's no sequentiality it's just repeated sampling and in some sense the best of K um baseline here is what would correspond to if you were to try to run something like GRPO here right because GRPO does not have a learning signal, any learning signal until you get the first solution. And of course, the feedback from the environment here would just be runtime errors and code unit tests. And so what we saw is that in on these hard tasks running self dissolation got a significant speed up over best of K and also another baseline which is just a multi-turn baseline which keeps all of the conversation history with the environment in context until it runs out of context and then has a first in first out queue. And so what this told us, this experiment is that self-dislation can really learn to solve hard tasks even before it ever solved the task, right? Just by the teacher providing directionally accurate feedback that points towards how to solve a task. And this was really one of the first ways in which we applied self-disciplination online in a continual way when given one particular task. And so based on this, people took this way further and this is super exciting. So I think very shortly after we put out this paper, someone on on on Twitter uh had this idea of continual code. So like cloud code but running a local model and that model is actually learning as you go. It's not just saving things in context or scaffolding. It's actually updating the model parameters. So here you have a GIF running and basically as he interacts with the model whenever the model does something that they don't want the model does some update. So here it's it's thinking and then he will reject the edit and he will say that he wants the helper to be minimal and some other instruction and then the model would actually do a training step. So it was a first sketch of an idea of like a continual learning system that actually updates weights as you interact with it naturally. And then I think over the last few weeks this has been picked up by a library called openclaw RL which is running on policy selfisolation under the hood but is extending this way beyond just coding agents but putting this into open claw. So having your agent interact with whatever tools you give it access to and then having the agent actually learn over time and in a synchronous fashion. And so that's all these things we're super excited by because these really point to a future where you have a model that learns online as you are interacting with it. And so this was kind of an overview of that. Just one more mention is that while there have been the three papers um that uh Edan and Thomas and I have been been leading there's also been a lot of other really cool research that has come out and here's just a few papers um all research that has come out in the last month or so. So yeah that's all and thanks a lot. I'll also add one thing that um both SDFT and SDPO the two versions of the self distillation algorithms are available with TRL um in the last week hugging face people merge um an implementation in their codebase so you know it make it much easier for everyone to play with these ideas. Thanks a lot for the the presentation. You guys there's Shannon Sans from um new research. They want to know like okay like this is what you guys have been working on. What are you guys working on now um in this direction? Can you share a bit with this? >> For me one of the kind of like there are two things that remain to do at least in my eyes in this field. One is just scaling up and that kind of hard to do in the academia and we hope that you know various frontier labs will just take this idea and uh scale them. The other one um is which is more of like not in the scaleup but more in the last example that Yonas gave where like you have one user interact with one agent and you want to improve on this conversation is on the opposite side is about sample efficiency where you don't have scale. You just have one user provide you know a few points of feedback and the question is how can you learn from that and like in context learning can learn from that like we know that put even one sentence you know into the context and it will change the model behavior quite well but self distillation although it's able to like kind of propagate the same kind of behavior change it's still limited by the fact that it's doing um like gradient descent basically and we know that gradient descent only take you know small push the model a bit at a time that's like inherent in the algorithm. Um so currently we're looking into ways to make it the supervision even denser to make the update such that that even with a single point of environment feedback you can change the model quite a lot without forgetting. So you know an extreme case will be that if the user say never use the word the letter F that's it the model will stop use the letter F with one update to the weight. That's kind of for me that's the dream. Of course this is like a silly example but like you know kind of the same way that I don't need to tell you things thousand time in order for you to learn the concept. I want models not to be needed to tell thousand time in order to learn content. So this is kind of like what excites me these days and we have some ideas that hopefully in the next months or so we'll share about how to do it. >> Yeah. Yeah. Same here. So, um I I do think this is maybe the central at least one of the very central um immediate kind of questions. And I think these this is one like this this question of can we have a learning paradigm that can actually do parameter updates but be as learning efficient or as sample efficient as humans are. Right? This is like a big question and has been a big goal of the field and I do believe that right I I so I think one of part of the magic of in context learning has been that in context learning is somehow that right it's it is somehow extremely sample efficient and I think that's what's so excite has been so exciting about it and also so useful about it but also as we mentioned it's inherently transient so the question is can we do lasting learning but in the same sample efficiency as say in context learning or humans is a is a very appealing um question and I I do think like there's a lot of promise there and and then of course there's other things like also the the thing which I I sketched out earlier a bit was this this question of how can a model elicit the or like find the right feedback or seek out the right feedback from its environment >> there was a question from Jasper Lou uh from Figma. He's asked like has there been any work exploring this in a more subjective feedback space uh where like the feedback is like um not necessarily like hard feedback there the you show that with the wild um uh well chat but like is there uh um is there any hope when the feedback is not clear at all like when the model need to kind of interpret it or like generate its own feedback? I think a yes there has been several things which we looked at. So one in terms of uh subjective feedback something which I didn't mention or which we didn't mention in the presentation was part of what Thomas also looked at was personalization to user preferences. So as opposed to the wild chat uh example where we just did general alignment to a large population the question of okay if you interact with a user and the user exhibits certain preferences like I don't want emojis or I don't like sick offency right or I want I want short responses the user like the model is able to pick this up through on policy self-distillation and um so that is like a subjective type of feedback and then in terms of an imperfect type of feed feedback. I think that's more what we try to look at in the context of this solving hard tasks live code >> thing where where the feedback is inherently incomplete like it may be directionally right but if you just see a runtime error it doesn't tell you yet how to solve the problem right it might constrain your solution space in some way but it it doesn't leak the answer and I think both are important and there's lots of cool other things we can do >> I'll add that in cases where the feedback is not like very clear where like one need to reason and extract you know kind of what is the actual change that needed to be done. A very cool idea that one can do and we didn't like had time for it but I'm sure it will work and I'll hope someone will take up on that is just to let the model reason a bit you know give it the context and just before you get to the part where there is a new answer that you do self distillation over give it the opportunity to do some chain of thought understand a bit you know what is the thing you need to take out of this additional information yeah and I'm sure it will work I'm sure it will improve things and again I'm I'm sure we'll also see a paper doing that in the next few months. [snorts] Someone would take up on that. >> This is good. Um also like um uh um I don't know like the the quality aspect of the demonstration and like um uh what's your thought on that? like a um let's say you had like 100 okay uh like not high signal feedback versus like 10 very excellent one. Um I know like the like the model like like to have multiple exampler and then like iterate on that. Um what which in in the current paradigm of self distillation which will they prefer here like the the high quality one or like the repeated the feedback ones? So at least I think from the experiment I did that like the more the hund um you know more examples but like medium quality will give you better performance eventually as long as there is some signal that you can extract even just telling the model you know here's some feedback think again about what you did give me an better answer will push you forward um and but I'll say that this is not really about the objective it's just because as we mentioned CR and descent learning likes a lot of example likes coverage it's help for generalization help for wider minimas stuff like that so it have to do a lot with the optimizer that we use >> yeah I would say I I would I would agree with that I think it's it becomes a little bit of a of a subtle question when you consider that the feedback that you get or that your student gets it it need not necessarily come from new data that you obtain from environment. So for example in Edun setting when you have expert demonstrations you could like solutions like ground truth solutions which may be very detailed you could have your model sequentially generate attempts train with the feedback then generate attempts again which may be a bit better but not fully solved it yet and then get feedback again and so these are these are examples I think where right now you would still prefer to get feedback like a lot of times Right. But the feedback does not require in that context another data point. >> True. True. Oh, >> it remind me of how like um u I was learning like um I don't know verifi verifiable uh element like math or physic like I had like this kind of red green yellow method where I would just try the exercise and if like I completely messed it up I would put the red there but I will still look at the answer try to figure my way out right and then I I'll just move on and if like I I kind of didn't get did it right. But like I knew exactly why I put the yellow there. But then all the green ones that I got first try, I never do them again. Like I got them. So what's the point, right? But then I just do iterative passes on like on the yellows and then iterative passes on the red and I still got like good learning signal out of it. And at some point there's like the three last one that are still kind of all messed up. Uh and in this specific case I know that like I need to get better feedback. So then went get a better feedback from the future whatever. Um but I feel like it's kind of the same situation where like yeah multiple I got multiple exampler of the same stuff right but with different understanding right and while while you go through like the the rest of the training data set and then you come back to those that you messed up um you're not going to like with the feedback will be different because you're going to do it a bit differently. you can learn something in this u specific uh uh exercise that will help you in those that you're struggling. Um I feel it's like a similar kind of vibe uh on the on this >> and it also reminded me of how I how I study for exams for sure. >> [laughter] >> Um last question. Um I I want you to think hard here. Like do you think there's a setting where like self distillation will be um kind of strictly poorer than like the standard RLV or GRPU? Like do you think that there's a setting here that like salation will struggle compared to GRPO? >> One setting that is kind of obvious is when you have weak models, right? If you're trying to improve 350 million parameter model with a very weak >> in context learning capabilities, it will not work the best. I'll say that like um we tried um here in MIT to apply self distillation to robotics to vision language action models DLAs and it just doesn't work like this models does not have in context learning capabilities strong enough um to actually learn while you know JPO even simple reinforce can improve these models quite a lot so that I'll say the the obvious uh answer um more than that I can say that there are cases where I don't know if it will do poor but it just not really needed if you just want to kind of sharpen the distribution you know like whatever you want to learn is already kind of like >> inside your um distribution you get let's say always like around 50 60% success rate and you just want to push the 50 to the 100 on a you know inside a group like you sample a group for a prompt you get out of like you know your like however many like responses you get most of them correct and you just want to make sure that you don't let's make sure all of them will be correct then self distillation is not really like maybe it will work but I'll not say like avoid the complication you know right >> pure GPO um because then there is nothing to gain from like you know some external knowledge additional knowledge like really good credit assignment stuff like that that where self distillation really shines will not help you that's not what you Yeah, I maybe to pick up on that point. Um I think like normally in RL like there this is trade-off between methods that are unbiased but exhibit a lot of variance and methods that are maybe biased but have much smaller variance. And in many ways you can think um of self-dislation in that lineage of of kind of a method that may be a bit biased depending on how good your models and context learning ability is etc but which has much less variance in its update signal that of course is a trade-off right and um I think it depends often on like which one is better depends on whether you're comput bound whether you're data bound It depends on how much compute you have available in total. And so right now as Edan said if you have a if you have a like a large pipeline and you have unlimited compute and you have your data environment set up that give you like 50% success rate in every [clears throat] group. you should run you should run an unbiased method probably and that will be tends to be more stable right >> long term I don't it's it's very hard to predict what will happen >> okay that makes a lot >> but yeah I I think that the things which like we're or I personally at least I can I can say that I'm most most excited about are the settings where there isn't even a method that can learn from that type of data effectively. >> So do you give an example? Um >> yeah, for example, learning from userations or like learning learning in this kind of online fashion in like open claw RL like where you just interact with your environment in your environment returns you whatever >> and >> because then you you increase massively the amount of uh data that you can shove into the model and still make it learn in a stable fashion. >> Um agree. Yeah. >> Yeah. >> And if there one thing that we learned over the past two decades in deep learning that is all about scale, you know, >> find ways to shove more and more data into your model. >> Cool. Fantastic guys. Like this was absolutely awesome. Uh really thankful for you to have taken the time to come and answer all of these questions. And that's it for today folks. Uh if you're interested in diving more into this whole method of self distillation, check out the links in the description. I try to put as much paper that are relevant as possible. And if you have any question, don't hesitate to shoot them in the comment.",
"timestamped_text": "0:00 Currently most post- training of large\n0:02 language model is done via reinforcement\n0:04 learning method like GRPO. The way it\n0:06 work is you take a problem you have the\n0:08 model generate a bunch of rollouts you\n0:10 score each of the roll out with a\n0:12 verifier usually did you get the right\n0:14 answer yes no and then you update the\n0:16 weights based on which attempt were good\n0:18 and which were bad. The issue is that\n0:20 this reward signal is extremely sparse.\n0:23 You get one score per entire rollout. So\n0:26 the model has to figure out on its own\n0:28 which of the token in a thousand token\n0:30 chain of thought actually matter. This\n0:32 is a pretty brutal credit assignment\n0:34 problem. In this video we're going to\n0:36 explore a family of method that kind of\n0:37 sidestep this bottleneck entirely using\n0:40 something called self distillation. The\n0:42 core idea is that in many of these\n0:44 environment you actually already have\n0:46 rich textual feedback. Things like\n0:48 compiler error runtime exception judge\n0:52 evaluation that sort of stuff. These\n0:54 explain why an attempt failed and\n0:57 current oral method just don't use that\n0:59 information and in the extrem they\n1:02 collapse all of this into a binary\n1:04 reward. What self distillation does is\n1:06 it takes the model own give it back to\n1:09 the same model but now condition on the\n1:12 whatever feedback you got from the\n1:14 environment and it let it re-evaluate\n1:17 its own tokens. In this distillation\n1:19 setup, the teacher and the student are\n1:21 literally literally the same model, but\n1:23 the teacher just see more context. This\n1:25 create a dense token level learning\n1:27 signal while at the same time being\n1:29 cheap to produce since it's literally\n1:31 just a single forward pass over the\n1:33 existing rollout. This approach was\n1:35 developed by two groups working\n1:36 together. We have the people at ETH\n1:38 Zurich with Yonas, hope I pronounced the\n1:41 name right, who developed the SDPO for\n1:43 the reinforcement learning setting. And\n1:45 then we had Eden Shenfield at MIT who\n1:48 developed the SDFT for continual\n1:51 learning from demonstration. Both paper\n1:53 dropped around the same time in January\n1:55 2026 and they actually like met earlier\n1:58 on in 2025 to discuss this idea. What I\n2:01 really like about this paradigm is that\n2:02 it's simple. It bootstrapped learning\n2:04 using the model own in context learning\n2:06 ability and it just literally actually\n2:08 worked really well. On the SDBO side, it\n2:10 reaches GRPO accuracy six times faster\n2:13 in wall clock time and produce reasoning\n2:15 traits that are up to 11 times shorter.\n2:17 On the continual learning side, SDFT let\n2:20 a single model learn multiple skills\n2:23 sequentially without forgetting the\n2:24 previous one which standard SFT\n2:26 completely fails at. This family of\n2:28 method already getting picked up in\n2:29 production with system like openclaw RL\n2:32 and frontier open source model like GLM5\n2:34 were using similar approach in their\n2:36 post- training pipeline. Yonas and Edan\n2:38 will walk us through both method and how\n2:41 self distillation work at a high level\n2:43 and I'll be asking them a whole bunch of\n2:44 questions. Thank you to everybody that\n2:46 sent question my way uh during the live\n2:48 stream. It was super helpful. Enjoy.\n2:50 >> So hi, happy to be here. Um Idan\n2:53 Shenfeld, I'm currently a PhD student in\n2:56 MIT working with professor Pulkit\n2:58 Agawal. Um my research is focused on\n3:02 reinforcement learning algorithms mainly\n3:04 for LLMs but also for robotic\n3:06 applications. Uh before my PhD uh I did\n3:10 research on reinforcement learning in my\n3:12 undergrad as then work on autonomous\n3:14 driving for a bit was part of General\n3:16 Motors big autonomous driving project.\n3:19 did a lot of cool applications of\n3:21 reinforcement learning for the real\n3:22 world there trying to solve problem that\n3:24 actually involve a lot of moving parts\n3:27 like other drivers for example\n3:30 pedestrian etc. Um, I was also at Deep\n3:34 Mind was part of the post training team\n3:36 as an intern at summer 24. Um, yeah,\n3:41 >> man. You were everywhere. [laughter]\n3:44 How old are you? You you're you're\n3:47 already done like the average student. A\n3:49 bit older than Leonas, that's for sure.\n3:52 >> But I I wanted to ask you this um\n3:54 because I saw that you were like doing\n3:56 work that was more practical and then\n3:58 like doing the research uh stuff. Uh\n4:00 what would you say is like the biggest\n4:02 difference in like mentality that you\n4:04 have when you do that type of like more\n4:05 practical minded work that you know is\n4:07 going to go into like I don't know like\n4:09 a car or something like that and like\n4:11 some the more research u long-term type\n4:14 of stuff like how's the switch going in\n4:16 your in your uh in your work?\n4:18 >> Yeah. So I think when you work on an\n4:20 actual product actual application what\n4:22 you really care about is performance. uh\n4:25 understanding is just a tool to get the\n4:28 right performance and sometimes you know\n4:30 like as much as understanding is fun and\n4:32 research is fun just going through your\n4:33 data set and cleaning it give you way\n4:35 more performance boost than anything\n4:37 else. So it's kind of you need to kind\n4:39 of stop yourself from focusing too much\n4:42 on what is you know like cool or fun or\n4:45 new and do the like you know grind work\n4:48 first and only then like put effort well\n4:50 I feel and this is one of the reasons I\n4:52 like like went from industry to PhD that\n4:55 as a researcher that don't care about a\n4:58 specific application\n4:59 the thing that we bring into the world\n5:01 is new understanding is new knowledge\n5:04 and this require completely different\n5:06 like perspective I don't care about like\n5:08 being state-of-the-art on any benchmark.\n5:10 Uh I also think and this is in general\n5:12 good that the community kind of went\n5:15 away from that. If you saw papers from\n5:17 you know 20 I don't know 18 1920 it was\n5:20 all about being number one on the\n5:22 imaginet leaderboard stuff like that and\n5:24 I think that these days you know people\n5:25 still use benchmark like they produce\n5:27 numbers on a and stuff like that but\n5:29 it's not really about being number one\n5:31 because there is always a bigger model\n5:33 that will be number one. Um, it's all\n5:36 about what kind of new understanding,\n5:38 new knowledge, new perspective you give\n5:40 to the community. And I think that this\n5:42 is much more fun than just, you know,\n5:44 playing the numbers game.\n5:46 >> Yeah. Yeah. Yeah. I I agree here. I I\n5:49 totally agree. I also agree that like to\n5:51 be honest like because I have like a\n5:53 consulting practice where I literally\n5:54 just go with like businesses and we try\n5:56 to implement the stuff and they always\n5:58 want to go with for the flashy thing but\n6:01 most of the gains are just like hey\n6:04 let's let's look at each of the data\n6:05 point one by one right\n6:07 >> and see like and see how how messed up\n6:10 they are and then like you clean it up\n6:12 and then you automatically gain like a\n6:14 massive gain and then you're kind of\n6:16 done like everything else is extra um\n6:19 Cool. This is really nice. Um I had the\n6:21 other question but I'm going to leave\n6:22 them for later. Uh Yianas you can go uh\n6:25 go ahead.\n6:25 >> Yeah. Awesome. Also great to be here.\n6:28 I'm Yonasota. I am PhD student at ETH in\n6:32 Zurich working with Andreas Kraza. Yeah.\n6:35 My research has primarily been focused\n6:38 on something that's called testime\n6:39 training which is effectively this idea\n6:42 that you can continue training a model\n6:45 at inference once it's been given a\n6:47 task. And my primary motivation is the\n6:50 question of how can we have models that\n6:53 are at inference time deployed in a new\n6:55 environment and then continue learning\n6:58 and exploring within that new\n7:00 environment to eventually solve very\n7:01 hard tasks which initially were\n7:03 completely out of distribution of their\n7:05 initial abilities, right? Um and that\n7:09 entails a lot of different questions and\n7:11 of obviously it touches on continual\n7:13 learning. It touches on this question,\n7:15 how do you do you do effective\n7:18 exploration and a lot of other questions\n7:20 as well. A lot of my prior work let's\n7:23 say to this was primarily focused on\n7:26 this question of how do you do effective\n7:27 exploration within that new environment.\n7:30 Yeah. Before that I studied computer\n7:34 science in Zurich and in Munich.\n7:38 >> Good. And I had a question about like\n7:40 your um efficiency learning at test time\n7:43 active fine-tuning of LM influence uh\n7:45 sorry active fine tuning of LM it feels\n7:48 like really like um what's it like um an\n7:51 extension SDPO of like uh of that stuff\n7:54 that you were doing like were you really\n7:56 influenced by that work and just like\n7:58 the the next logical step\n8:00 >> uh to to some extent yes right like this\n8:03 other this other work that you mentioned\n8:05 in that context we were asking okay how\n8:07 can to continue training and improving\n8:10 an LLM at test time. Once it's given a\n8:12 task through gradient descent, there's\n8:14 obviously there's several aspects that\n8:16 are important when you want to improve\n8:18 your LLM. There's important what data do\n8:21 you do you learn from? And that's the\n8:23 question that we asked in that work. But\n8:26 then as we discovered in subsequent\n8:28 works, that's not the only question.\n8:29 Another very important question is what\n8:31 loss are you optimizing? and and so\n8:34 especially when you learn at test time\n8:36 it's very important that you one are\n8:39 very efficient in turning your data into\n8:43 gradient updates right um purely also\n8:46 from a practical perspective if you want\n8:48 to run these algorithms and do\n8:49 interesting things with these and\n8:51 demonstrate these interesting\n8:52 environments and we are compute bound we\n8:54 have to make the most out of the signal\n8:56 that we get that's number one and number\n8:59 two is I think in settings we also So\n9:03 usually just need better uh signals and\n9:08 you know just extract as much signal\n9:10 from environment as possible.\n9:11 >> Yeah. Yeah. I agree. I I really like how\n9:14 like because in in my view like the RL\n9:16 stuff there's just so much especially\n9:19 when it varable reward like you there's\n9:21 just so much extra things that like an\n9:24 actual human being that will would go\n9:26 through the same process. it will look\n9:28 at this stuff right they will look at\n9:30 the error trace and they will look like\n9:31 at the I don't know the documentation\n9:34 like the feedback from like an a demo or\n9:36 something like that\n9:36 >> and then we'll just integrate it in like\n9:38 its learning procedure 100%\n9:40 >> but ju just giving this the the end like\n9:42 hey you messed it up like oh okay sorry\n9:44 all the rollouts\n9:46 >> bad\n9:47 >> um it always felt a bit wasteful and\n9:50 they were that we were kind of missing\n9:51 something um and like I I I liked I like\n9:55 the the all the studies where there was\n9:58 teacher that was trying to generate\n10:00 rollouts, but then like you're all in\n10:02 token space and it's just weird like\n10:04 what you're going to do to match the the\n10:07 two kind of stream. So no, I was really\n10:09 excited when I saw like how you you guys\n10:11 were doing it. Um there was pretty good\n10:13 >> the original motivation was actually\n10:15 yeah for me coming from pretty much that\n10:17 angle because we were working on coding\n10:20 at the time and it just seemed obvious\n10:23 that you know at at that time the coding\n10:25 agents were not that good as they are\n10:27 now. And so when I was trying to make\n10:28 chatbt code, it was like this back and\n10:31 forth chatbt generating some code. I'm\n10:33 running it. I'm pasting back the error.\n10:35 And you know, it was at the time where\n10:36 where it started to get decent at that.\n10:39 So the intuition was okay, it's already\n10:40 very good at or somewhat it's starting\n10:43 to become very good at understanding\n10:45 what are the sources of the errors it\n10:48 makes.\n10:48 >> Yeah. And um I actually before we jump\n10:51 into the presentation, there was also\n10:52 your paper or razor why online\n10:54 referencement learning forget less.\n10:56 That's also something that is um that\n10:59 got me even more excited is that like\n11:01 you with this type of self distation\n11:05 method it seems I think it's easier to\n11:07 weave in like different uh learning that\n11:10 you you you you put there is some\n11:12 forgetting as as the the thing goes but\n11:16 it seems to be the type that is normal I\n11:19 would say for like um I don't know a\n11:20 human like if you were to learn like\n11:23 four things in a row right okay you will\n11:26 like forget a bit the first thing and\n11:27 stuff. Um but you will not\n11:29 catastrophically forget all the thing.\n11:32 Was this work also like uh helping you\n11:35 like go into that same direction?\n11:37 >> Yeah. So that was a big influence um my\n11:40 hourly razor paper was a big influence\n11:42 on SDFT. Basically I finished the hourly\n11:45 razor paper and our conclusion there was\n11:48 that on policy method tend to forget\n11:51 less compared to like offline method\n11:53 like SFT fine tuning and I was like okay\n11:56 that's very cool we mainly focused you\n11:57 know on like RL as the main on policy\n12:00 method that people use but I'm like okay\n12:02 my conclusion in my research can't be oh\n12:04 let's just do RL and throw any other\n12:07 kind of learning signal to the garbage\n12:09 you know that's not something that is\n12:10 like there as you said there's so much\n12:12 learning signal as like expressions and\n12:14 int and\n12:16 >> feedback and so on and I'm like okay we\n12:18 need to find an algorithm that is like\n12:21 >> on policy like RL but is able to kind of\n12:24 like learn from very diverse kind of\n12:27 signals um and that's where I like\n12:31 started to work on this idea and using\n12:33 in context learning you know with\n12:35 distillation to create the self\n12:36 distillation and actually like allow me\n12:38 to put a small story here about how me\n12:41 and Yona started to collaborate on that\n12:43 >> go ahead.\n12:43 >> Um so we know each other from conference\n12:45 like few years already and we met at\n12:48 Nurips's um this December and we sat\n12:50 down and you know as usually like we\n12:53 kind of like update each other on what\n12:54 like exciting project you were working\n12:56 on and you open his laptop and show me\n12:59 some slide on like it was the beginning\n13:01 of this SDP work and I'm like no way\n13:05 that's the exact formulation [laughter]\n13:08 I came up with and I'm opening my laptop\n13:10 and showing him like you know the\n13:12 beginning of like an overly like few\n13:14 paragraphs of ideas and few experiment\n13:16 like okay like we both stumbled on the\n13:19 same like you know idea here um\n13:21 different perspective I'll say like\n13:23 Yonas was focused more on the textual\n13:25 feedback that led to the STPO paper I\n13:27 was focusing more on learning from\n13:29 expert demonstration that led to the\n13:30 SDFT paper but that's how we are like\n13:32 okay we are working on the same core\n13:35 algorithmic innovation just\n13:37 >> collaborate that's what I found like so\n13:39 exciting at that time Because I think we\n13:42 were both thinking and part of the\n13:44 motivation was that this could be a\n13:46 learning paradigm that could enable\n13:48 learning really from very rich diverse\n13:50 data. But this really showed it to us,\n13:53 right? Because we basically arrived at\n13:55 the same underlying learning algorithm\n13:58 but coming from completely different\n14:00 directions and working on very different\n14:03 problems.\n14:03 >> Yeah. And I I mean like um because I I I\n14:06 I also really like to like learn from\n14:10 the old older paper, not really for the\n14:13 knowledge because like uh when you when\n14:15 you you push too far down there's a lot\n14:18 of noise because they're going to say\n14:19 like oh we think this and then you read\n14:21 it like 30 years later and like dude no\n14:23 like that was not it. Right? But like\n14:25 you you see some of these convergence of\n14:29 idea brewing that you know that in like\n14:31 a year from now they're going to get it\n14:33 right. they're going to get around like\n14:34 the a very good example of this is like\n14:37 just like the all the ResNet highway\n14:40 network kind of moment in in machine\n14:43 learning because there was all this\n14:44 convergence of ideas and then like the\n14:48 every time like it's like different lab\n14:50 that don't necessarily talk to each\n14:51 others and then they don't necessarily\n14:52 like also reference each other but then\n14:55 they come into the same direction and\n14:57 then the kind of paradigm that usually\n14:59 win is like the one that can distill the\n15:01 idea to the simplest atomic unit Right.\n15:03 And then this then get like tested on in\n15:06 like in in many different form. Um which\n15:09 is kind of what I I felt here because\n15:11 like you're in MIT he's in ETS Zurich.\n15:15 Um I was trying to find a connection and\n15:17 like hey wait so like these guys are not\n15:19 100% related. I know this this was also\n15:22 exciting validation uh from like a\n15:24 historical perspective. Okay cool. Uh\n15:27 this is pretty good guys. I want to say\n15:28 something about your previous point\n15:30 regarding how ideas kind of converge.\n15:33 >> Yeah,\n15:33 >> I think like ResNet is a great like\n15:36 example. Another great example is\n15:37 attention, right? Essentially all you\n15:39 need is not the first attention paper,\n15:41 but it's you know\n15:42 >> there are many. I think when it comes to\n15:45 um our work on self-distillation there\n15:47 is a core idea that I start to see like\n15:50 coming up recently and uh Omar Katy with\n15:53 his RLM work like also kind of push that\n15:57 which is that we got to the point to\n16:00 where models can start to be the force\n16:03 that push their own learning algorithms\n16:06 like they have enough capabilities that\n16:08 end you talk to JGP which means that\n16:10 like we should not train them the same\n16:12 way we train three layer MLP should use\n16:15 their own capabilities, their own\n16:16 reasoning to drive new learning\n16:19 paradigms.\n16:20 >> Exactly. And I think like like two three\n16:23 years ago like in context learning like\n16:25 it was I felt it was something very\n16:27 powerful and it tells you a lot about\n16:29 the model, right? Because it knows a lot\n16:31 and he has like learned all these\n16:33 function that it can recombine. So the\n16:36 theoretically if you give it like some\n16:37 some like a direction and new\n16:40 information it's not like it's not a\n16:43 dumb rock like it will be able to it\n16:45 will be able to kind of go and and and\n16:47 and go in a direction and you like these\n16:49 model that we using cloud and stuff like\n16:51 as soon as they go and then they search\n16:53 information on the web and stuff like\n16:56 it's pretty good like it will it will be\n16:58 able to like recover and then and\n17:00 iterate and and do their stuff. Um, so\n17:02 if there's a way to just like make like\n17:05 it's more of like allowing them to do\n17:07 the learning, right? A bit like Ilia\n17:09 said like these models just want to\n17:10 learn like just allowing them to have\n17:14 the chance to like incorporate the right\n17:17 signal and the learning. I think that's\n17:19 kind of one of the the bottleneck and if\n17:21 you allow them to do that then they\n17:23 they're able to learn a bit like\n17:25 reasoning. like a reasoning it's like in\n17:27 my view like as you just start to loop\n17:29 back and back and let them like just do\n17:32 their stuff they kind of like were able\n17:34 to lift a constraint from themsel which\n17:36 is like I'm going to get an answer and\n17:38 then that's it and then we're done right\n17:40 it's now it's more like I'm going to get\n17:42 an answer and then like put out some\n17:44 scaffolding and then like use that\n17:45 scaffolding to be able to generate the\n17:47 right stuff and then then I output the\n17:49 stuff there's a lift directly in\n17:51 performance all across because kind of\n17:53 like this constraint that we've put on\n17:54 the on the main structure is lifted. I\n17:57 feel it's it's literally the same thing.\n17:59 It's like this constraint of like okay\n18:01 now what if the context window reset oh\n18:04 you have to do this again eh but like\n18:06 now it's like yeah you were able to\n18:08 learn from these and then you were able\n18:09 to move your weight into kind of like a\n18:11 a peak in in a value that like make more\n18:13 sense for like this these specific tic\n18:16 user. uh this makes a lot more sense\n18:18 because like you don't have to mess\n18:20 around with like crazy memory scheme and\n18:22 stuff like uh it will have a better\n18:24 sense about like what to do next. Um and\n18:27 if like you like if if you you weave it\n18:30 in so that like there's there's not too\n18:32 much catastrophic forgetting that\n18:34 happened, it's a very viable in my\n18:36 opinion kind of pattern uh to have these\n18:38 model um have.\n18:40 >> Yeah. So we put together some slides. Um\n18:43 what we'll talk about will be a little\n18:45 bit of a unified perspective of these\n18:47 three papers which we put out which as\n18:50 we discussed they're all on the same\n18:52 algorithm. The algorithm is exactly the\n18:54 same in fact but they touch on three\n18:56 different perspectives of how that\n18:57 algorithm can be useful. And really\n19:01 special thanks go to Thomas who has also\n19:04 been leading one of these three [snorts]\n19:06 papers and a super exciting one and\n19:08 we'll talk also we'll cover that as\n19:10 well. What I would like to start with is\n19:13 this observation that in many ways\n19:15 current learning paradigms are imperfect\n19:17 and that surfaces in different ways. So\n19:20 as we talked about one big issue in many\n19:23 learning paradigms is that they lead to\n19:25 catastrophic forgetting. So meaning as\n19:27 you try to learn a new task, you have\n19:30 this catastrophic behavior that you\n19:33 become significantly worse on previous\n19:36 tasks that previously were good at. And\n19:38 that's obviously not something that we\n19:40 want. Then the second problem I think\n19:42 surfaces in a little bit more in a\n19:45 little bit more subtle way which is that\n19:48 most current learning paradigms they\n19:49 require very careful designing of the\n19:52 data that you put into them. So this\n19:54 could be SFT data sets or this could be\n19:56 RL environments. And I think this shows\n20:00 predominantly in the fact that there\n20:02 have been a surge in startups or even\n20:06 like scale AI now a bit older and\n20:07 already acquired. really a search in\n20:09 companies that whose primary aim is to\n20:13 organize um the data and I think a dream\n20:17 of many is to have eventually a learning\n20:19 paradigm where the models are able to\n20:21 make sense of the raw data itself as\n20:24 they interact with the environment and\n20:26 go out in the world and find the data.\n20:27 And then the third one of course is this\n20:30 observation that often systems have some\n20:32 kind of brittle generalization that if\n20:34 you prompt them in exactly the right way\n20:36 they do what you want them to do but if\n20:38 you just change your prompt a little bit\n20:40 they seem not to be able to do what you\n20:43 want them to do. Meaning somehow they\n20:45 have this quote quote like jagged\n20:47 intelligence. They didn't seem to fully\n20:49 internalize what you were aiming to\n20:51 teach them. And so what we instead want\n20:53 in a learning paradigm of course is we\n20:55 want some system that is able to\n20:57 continuously improve forever without\n20:59 degrading. We want some system that is\n21:01 able to learn from arbitrary real data\n21:04 not some handdesigned or filtered data.\n21:06 So really real data is you just deploy\n21:08 it in some environment. And we obviously\n21:10 want systems that generalize. And so the\n21:13 thing that we will we will talk about is\n21:15 this thing called on policy\n21:16 self-distillation. And we really view\n21:18 this as a new learning paradigm that\n21:20 enables continual learning from\n21:22 arbitrary data and makes some\n21:24 improvements on the three dissidorata\n21:26 that we outlined.\n21:28 >> What's your take on forgetting? Because\n21:30 like continuous learning is like is one\n21:33 thing but if we're talking about like a\n21:35 system that kind of is is learning from\n21:38 your your your pace and like your way of\n21:41 working here from like a human or like a\n21:43 process. Uh but like I said the process\n21:46 in the human is flawed right and then\n21:48 like okay if not finally like that was\n21:49 the wrong direction we need to go there.\n21:50 What's your your take on on that stuff\n21:53 like the the forgetting the ad\n21:54 adaptation uh in general?\n21:56 >> Yeah I can answer on that. So for me at\n21:59 least like the ability to learn without\n22:02 forgetting is almost a must in the world\n22:05 where you know we have AI models that\n22:07 are actually being deployed like think\n22:09 about let's say cloud or chip like\n22:13 according to recent numbers they like\n22:14 process around 3 billion messages a day\n22:17 not learning from these messages it's\n22:19 you know it's a waste of data like user\n22:22 interaction environment interaction\n22:23 going to be the largest source of data\n22:26 that we have to train models\n22:28 Um but unlike other sources of data\n22:30 which we have you know in some like you\n22:32 know fixed container and we can choose\n22:34 like when and how we we can use this\n22:37 data is a streaming data. Every day you\n22:39 get it a bit by bit and therefore you\n22:41 have to do like you know this kind of\n22:44 continue learning which is essentially\n22:45 learning without forgetting. you have to\n22:47 be able to update yourself every day a\n22:50 bit becoming a bit better aggregate\n22:51 another new skill another new capability\n22:55 um without like you know start to\n22:58 degradate on what you know already uh\n23:00 this is true in the general sense of you\n23:01 know let's improve the model from three\n23:03 billion messages and this is also true\n23:05 in the small sense of if I have my own\n23:07 you know open cl and I want my open\n23:09 cloud to be like tailored to my\n23:11 preferences um also it will not get like\n23:14 one big data set of my preferences it\n23:16 will get every day I'll tell it a new\n23:18 thing that it need to remember. Um and\n23:22 this is just very different than the way\n23:24 we think about you know classical\n23:25 machine learning when we start oh assume\n23:27 a data set ID from some source you know\n23:30 and therefore forgetting is something\n23:33 that really need to be dealt with. Our\n23:35 agenda um for this talk we'll try to\n23:38 keep it somewhat short is first of all\n23:40 to cover what is self distillation what\n23:42 is the core ideas behind it and to kind\n23:45 of goes through the different parts of\n23:48 self of post training that we believe\n23:50 self-distillation can take part of which\n23:53 to be honest is most of self\n23:55 distillation so we'll cover stuff that\n23:57 like how self-distillation can be used\n23:59 to learn as um from demonstration as a\n24:02 replacement for SFT from verifiable\n24:04 build reward replacement to RL and also\n24:07 to learn from new kind of um learning\n24:10 signals such as rich environment\n24:12 feedback and real user conversations\n24:14 that are not able with the current\n24:16 methods. So let's start with like in\n24:18 context learning. So in context learning\n24:20 um I think these days we almost take it\n24:23 obvious like okay everyone know that\n24:25 it's happened but like I still remember\n24:27 3 years ago when like people start to\n24:28 realize that this has happened and it\n24:30 just seems magical like you just put\n24:32 data into context and the model just\n24:34 change its behavior and is able to learn\n24:37 from whatever data you put into context\n24:39 and it doesn't have to be really nice\n24:41 form like examples. It can be hints, it\n24:44 can be feedback, it can be random\n24:45 instructions. And it's also generalized\n24:47 very well. But, and this is a big but,\n24:50 it's very transient. Like if today I put\n24:53 some context into the model, tomorrow I\n24:55 start a new session and the context is\n24:57 not there, the model goes back to how it\n25:00 was. There's nothing that just stays\n25:02 there. And the second is that which is\n25:05 another big limitation is that context\n25:07 windows are bounded. And in the end if\n25:09 again I want to become better at every\n25:11 possible task on errors I cannot put\n25:13 examples for every possible task in\n25:15 context. And therefore like when we\n25:17 looked in in in context learning we're\n25:20 saying okay we have something that is\n25:21 really good at our hand but we need to\n25:25 somehow compress it. We need to somehow\n25:27 take it and put it into the model weight\n25:30 to make it stop being the transient and\n25:32 become something more like that stays\n25:34 there even when we remove the context.\n25:36 And this led us to the core idea of self\n25:38 distillation. I'll now cover like\n25:40 briefly the algorithm itself. We take an\n25:43 LLM and we operate it in two modes. The\n25:46 first mode is the student mode. We just\n25:48 get the input prompt some question from\n25:50 a user X and it's output a response Y\n25:55 from a second mode is a teacher mode\n25:57 where in addition to the input prompt\n26:00 the same LLM also get an extra contact.\n26:04 See this extra content again can be\n26:06 expert demonstration, instruction,\n26:07 feedback, whatever. But the important\n26:09 thing is that now the model is\n26:11 conditioned on another input the context\n26:14 and this automatically change its output\n26:17 distribution. Now the output\n26:18 distribution the responses that the\n26:20 teacher would have produced uh if we\n26:22 sample from it are kind of different\n26:24 from the student and therefore the coron\n26:26 self distillation is just to use this\n26:28 teacher and do teacher student learning\n26:31 all paradigm in machine learning where\n26:33 you just minimize the some\n26:34 distributional measure and this in our\n26:36 case we chose the reverse scale um\n26:38 between the teacher and the student um\n26:40 and take the gradient of course only\n26:42 through the student and not through the\n26:44 teacher because the teacher is the one\n26:46 that guided the learning. This is a\n26:47 distillation algorithm uh very similar\n26:50 to what people have been using quite a\n26:52 lot big to small model distillation but\n26:54 with one important change that the\n26:57 teacher is the one that is changing it.\n26:59 Um the the model itself is the one that\n27:02 guiding the learning. Um and why we like\n27:04 it is that this is much more similar to\n27:07 how human learns like humans don't learn\n27:10 by looking at some example and just\n27:12 trying to mimic it you know onetoone. Um\n27:15 what we usually do is like kind of we\n27:18 observed we condition our short-term\n27:20 memory condition our actions on whatever\n27:23 we see in front of us whatever feedback\n27:25 we got from the environment and we use\n27:27 that to improve our behavior. So kind of\n27:30 the force that push the improvement in\n27:32 humans behavior is humans themselves and\n27:35 we want to give models the ability to do\n27:37 the same. at the same idea let's say\n27:40 like following this idea how do we do it\n27:42 in practice given the teacher and the\n27:44 student we can compare their output\n27:47 distribution for every single so we go\n27:50 over some assistant answer in this case\n27:53 um the user said answer was yes or no is\n27:55 water we get and the assistant produced\n27:57 an answer that is much more than the\n27:59 user asked yes the water is wet and the\n28:02 user said and I said yes or no only um\n28:05 in hindsight if we'll take the full\n28:06 conversation put it context and ask the\n28:09 model or like let's go token by token\n28:11 here over your assistant and see how now\n28:14 that you know that the user said\n28:16 afterwards I said yes or no only the\n28:18 probabilities will change stuff like yes\n28:21 will become more probable in handsight\n28:23 and everything else become less likely\n28:26 in handsight and this is the core uh\n28:28 algorithm object that we're working on\n28:31 with uh the log probability ratio\n28:34 between the original model and the\n28:36 improved policy. This is actually very\n28:38 interesting because when you take this\n28:41 log probability ratio, there are two\n28:43 perspective that you can use about how\n28:45 we can use it in learning. One is that\n28:47 we say that this is just a token level\n28:50 advantage or a token level reward. This\n28:52 is similar to how in normal JPO where we\n28:55 train LLM the advantage is the you know\n28:59 the reward of this answer minus the\n29:01 average of the rewards across all\n29:03 answers. Here we just say okay instead\n29:05 of using that signal let's use another\n29:07 signal but plug it back into the RL\n29:10 algorithm at the same time we can use\n29:12 look at it as distillation algorithm\n29:15 where we want to if we sum this log\n29:17 probability across the entire like token\n29:20 space we get just the KL measurements um\n29:23 between as I said before the student and\n29:25 the teacher now one nice thing that we\n29:28 like about the um self distillation\n29:30 perspective is that unlike RL where you\n29:34 take you get feedback only on the tokens\n29:36 you sampled. If you take the full KL you\n29:39 take you get feedback you get a learning\n29:41 signal over the full token distribution\n29:44 which is much much more informative and\n29:48 uh you know if we had the whole\n29:49 conversation about like how normal RL\n29:52 RLVR is just like one bit of information\n29:55 per trajectory. If we do this like rich\n29:57 advantages, we get more bits because now\n30:00 for every token we get some feedback.\n30:02 But if we do the full KL, that's a very\n30:05 rich signal to learn from. And we'll see\n30:07 later how it also leads to um faster\n30:10 convergence. So another nice thing is\n30:12 that we said okay we let the model\n30:14 decide how to learn and when to change\n30:17 its behavior like how much can we trust\n30:19 it like how maybe this process is very\n30:21 noisy. So in one of our works led by\n30:24 Thomas um he compared basically um\n30:27 different trajectories. So what do we\n30:29 have here? We have like each one of the\n30:31 tables with the colors you see in front\n30:33 of you is a table of trajectories. One\n30:36 each row is a different trajectory and\n30:39 each column is a token. So we have 20\n30:41 something trajectories with 128 tokens\n30:44 each. And basically we said okay if the\n30:47 and this is similar to before we have\n30:49 like a multi-turn uh conversation with\n30:53 uh ani with the user give feedback. Now\n30:56 sometimes the user give relevant\n30:58 feedback relevant user follow-up and\n31:00 sometimes the user ask an unrelated\n31:02 question after that and one nice thing\n31:04 that we noticed regarding in context\n31:06 learning in this experiment is that the\n31:08 model will decide to change its original\n31:11 answer only if the user gave relevant\n31:13 feedback. So you see above basically the\n31:16 color is corresponds to how much a log\n31:18 ratio change uh the probability is\n31:21 changed which means that the model\n31:23 decided to um that there is another\n31:26 answer that will be better given the\n31:28 feedback and you see that there is much\n31:30 more color on the top plot than on the\n31:33 bottom plot. Basically, if the model\n31:36 didn't got any relevant feedback, it\n31:38 will not change its answer. Which means\n31:41 that this incontext learning only push\n31:43 the learning only change the model only\n31:46 when we have relevant feedback.\n31:48 >> The whole system seems to hinge on in\n31:51 context learning ability of the model,\n31:53 right? Like it's it's leveraging this in\n31:56 order to kind of like guide the\n31:58 trajectory. And and one of the question\n32:00 was like what if the model is not a able\n32:03 to kind of understand the correct answer\n32:06 right like a how to to like I don't know\n32:08 like a it's not necessarily like not\n32:10 have enough capacity because they're too\n32:12 small but it just doesn't understand\n32:14 what what is the correct answer it will\n32:17 still kind of judge the token output\n32:19 like is this going to lead to kind of\n32:21 just the the the direction being like\n32:23 random or like what happened in this\n32:25 specific case\n32:26 >> yeah in that case if the model like\n32:28 understood that there is some feedback\n32:30 that it should change its behavior but\n32:32 it didn't really understand how and like\n32:34 you know it's able to push it into a\n32:36 good direction. Uh then yes like the\n32:38 update can be quite random. The nice\n32:40 thing I'll say is that one um when you\n32:44 take you know usually you do like\n32:46 training over a whole data set of\n32:48 examples even if some of them the model\n32:51 was not able to understand you will be\n32:53 able to understand on others and the you\n32:55 know average gradient will still push\n32:57 you towards positive directions and\n32:59 second that like look at the models that\n33:02 we get every year we have new models\n33:04 that even with the same number of\n33:06 parameters just have much better\n33:09 capabilities. is. So if I compare, you\n33:11 know, these days Quinn 38B to like the\n33:13 first Lama 8B, the in context learning\n33:16 capabilities of the new Quinn are much\n33:18 much better. And I expect this to\n33:21 continue like growing like the new model\n33:24 of next year and the year after that\n33:26 will just have better and better in\n33:28 context learning capabilities and we'll\n33:30 be able to understand users better.\n33:32 >> Right. Maybe one thing I would add to\n33:34 that is I think the other aspect that\n33:37 this points to which is quite important\n33:39 is the feedback itself. And I think in\n33:42 the same way that as humans right we\n33:44 also rely on the feedback that our\n33:45 environment gives. I think to me this\n33:48 points to a super interesting field\n33:51 which I think will open up in the future\n33:53 is is the question of how does the model\n33:54 best like seek out the right information\n33:57 from this department as well as right in\n34:00 the same way that when we want to learn\n34:02 about a particular problem we have to\n34:03 learn about where to look where to get\n34:05 guidance from. If we search for people\n34:08 that help us Yeah. people that help us\n34:10 become better at XYZ, we try to find the\n34:12 right people that actually give us the\n34:15 feedback that we that our policy needs,\n34:18 you know.\n34:18 >> Yeah.\n34:18 >> And I think that will be super\n34:20 interesting in the future\n34:21 >> as a human like uh whenever I'm trying\n34:24 to learn and I had to to learn something\n34:26 like the hardest learning I I had to do\n34:30 it was not because the material was\n34:31 hard. It's just because like I had to\n34:33 kind of fight to get a signal, right?\n34:36 Like I just had the end signal and then\n34:39 I knew I knew it was like something that\n34:41 I had to do a lot of work to to get to,\n34:43 right? And sometimes like there is\n34:44 nobody to ask for. You just have to try\n34:46 a bunch of stuff and then you get like\n34:48 some more kind of uh idea about what's\n34:51 going on. But sometime it's just like\n34:53 it's this the environment was not well\n34:55 set to give me the signal. But as soon\n34:57 as I get the signal, I'm like, \"Okay,\n34:59 that's it.\" I just I just like get\n35:01 another textbook look at it and like\n35:03 okay that's just that's just the stuff\n35:05 it's not it's not that hard. So yeah\n35:06 it's true like the the um the the\n35:09 richness of the the feedback is uh is\n35:11 really important for for learning like\n35:13 in in organic being. I do agree and\n35:16 maybe you know the same way we see like\n35:18 a lot of uh environment design that go\n35:21 into RVR these days we'll see a lot of\n35:23 like environment plus feedback design\n35:26 that go into like training data loops in\n35:29 the future. Um we have one um Rubni\n35:32 Carmona uh he's asking like does this\n35:34 paradigm skew the model representation\n35:36 toward meta condition without the\n35:38 substance of actual condition and I ask\n35:40 him like what he means by actual\n35:42 connection and he says like if the\n35:43 representation of a task or ability a\n35:46 get progressively higher while losing\n35:48 the initial specificity he's talking\n35:50 about like if the model is not like\n35:52 losing a bit of specificity to kind of\n35:55 go a bit more higher level\n35:56 >> I think there are two kind of loops here\n35:59 there's the metacognition loop which is\n36:01 like kind of like like equivalent to in\n36:03 context learning like how much the model\n36:05 can learn from feedback and kind of like\n36:07 guide itself and there is the inner loop\n36:10 of the specificity which is like how\n36:12 much it can actually solve a problem\n36:15 like solve a question and I feel like\n36:18 here we mainly take advantage of the\n36:20 meta cognition loop to improve the\n36:23 specificity I believe that there are\n36:24 other works and this is a very exciting\n36:27 like line of research to see how we can\n36:29 improve also the outer loop how to make\n36:32 the model more aware to the fact that\n36:35 it's being used in this kind of like you\n36:37 know self distillation training so that\n36:40 it will be able to provide um better\n36:43 signal for itself actually like um Joe\n36:46 Parry from my lab here at MIT released a\n36:48 paper around a year ago about self\n36:50 adapting LLM that did exactly that\n36:52 basically an outer metacognition loop\n36:55 and inner RL loop where it's basically\n36:58 train the model to give better feedback\n37:01 to itself for training.\n37:04 >> So I believe this is like combining that\n37:06 with a self distillation kind of\n37:08 algorithms. So back in the self adapting\n37:11 adapting language model you did mainly\n37:13 supervised learning in the inner loop\n37:15 but combining the two ideas of improving\n37:17 the outer loop and improving the inner\n37:19 loop can be quite powerful. What are you\n37:22 guys thought? Because we're talking\n37:23 about continuing learning. Maybe you're\n37:25 going to you're going to talk about it a\n37:27 bit, but like a very very long horizon\n37:30 hard task where the feedback is rich,\n37:33 right? Like you are able to get some\n37:35 feedback about like if you're going a\n37:38 bit in in the right direction, right?\n37:40 What do you think here? uh would it\n37:43 would it be a useful kind of methodology\n37:45 to kind of refine the model uh as it is\n37:49 like going through the context in this\n37:52 let's say in RS like RM or whatever it\n37:54 is and then is able to kind of without\n37:56 blowing its context window uh manage it\n37:59 properly but like kind run these\n38:01 experiment get some feedback and know if\n38:03 like it's going into the right direction\n38:05 and like inject this into its weight do\n38:07 you know that this do you think that\n38:09 this has like some utility and like this\n38:12 more automated researchish like a long\n38:15 context um hard task.\n38:17 >> Yeah, I think we'll we'll have some\n38:19 early examples of that later uh that\n38:21 we'll that we'll cover. But I\n38:24 fundamentally agree I think 100%. Um to\n38:28 me something that also seems interesting\n38:29 is it seems to me kind of intuitively\n38:33 that as humans we do this reflection on\n38:36 various let's call it horizons on\n38:38 various over various frequencies both in\n38:42 terms of very like narrow and shortterm\n38:46 feedback that we get from our\n38:47 environment as we interact with it like\n38:49 immediate responses that the environment\n38:51 gives gives as well as over longer\n38:53 periods of time where I don't know we\n38:57 recapitulate. Okay, now that you know I\n38:59 talked to my supervisor about what I've\n39:02 done the last two weeks and he told me\n39:03 XYZ, maybe I [laughter] should have done\n39:05 something different.\n39:07 >> Yeah. But that that's what was in my\n39:09 mind because like um every time I was\n39:11 doing like every time I was doing\n39:13 research like looking back when I was\n39:15 done I was like that was like max four\n39:18 weeks worth of work, right? But\n39:21 throughout I was getting all sort of the\n39:23 signal that I had to untangle and try to\n39:25 cobble up the feedback in order to\n39:27 figure out like the direction like and\n39:30 like I I knew when I was off off, right?\n39:33 And the direction even though I was\n39:35 dumb, right? I knew like this was not\n39:37 the right direction for sure, right? So\n39:39 I had to kind of like go over there and\n39:41 kind of mess up a bunch of stuff here,\n39:43 get enough feedback like hey it's it's\n39:46 here and I just continue and then I got\n39:48 the stuff and then when I link them up\n39:50 it was like three four steps. Um but I\n39:53 was able to do it but I needed the\n39:54 feedback from the environment. So this\n39:55 is kind of what I was thinking like in a\n39:57 in a setting where you can get even if\n40:00 it's complex feedback like very very\n40:03 detailed one um you might be able to\n40:05 kind of steer the model to just have in\n40:08 this hands- um direction right uh\n40:12 because like let's say there's a\n40:13 discovery that we want to make now the\n40:15 model has it in its data set and it was\n40:17 trained with it knows right now it knows\n40:20 right but before it knew right it's not\n40:23 terribly different the model from like\n40:25 now than than like when it was trained.\n40:28 So being able to kind of get these\n40:29 signal there, maybe we'll be able to\n40:31 arrive at these kind of result um if the\n40:35 feedback is is rich enough.\n40:36 >> Yeah, 100%. I think it's a very very\n40:40 important point that you make there. To\n40:42 me, it's certainly the case that\n40:46 especially as we do as humans reflection\n40:48 over longer time horizons, we do this\n40:52 form of meta update of our strategy\n40:55 where we say once we made the discovery\n40:58 and we reflect on how in the future we\n41:00 can make these these these discoveries\n41:03 better. We don't say okay in hindsight I\n41:06 should have just you know straight\n41:07 oneshotted my discovery. What we do is\n41:09 we try to understand okay what kind of\n41:11 systems can I build or how can I improve\n41:15 my of way of doing research so that I\n41:19 will make discoveries faster and we are\n41:20 aware right through some form of trial\n41:23 and error and reasoning that this does\n41:25 not come through better one-shotting but\n41:27 through building better kind of systems\n41:30 and a better ways of probing and getting\n41:32 to answers quickly and so yeah I think\n41:34 like that's super interesting I'm very\n41:36 sure that over this like over this year\n41:40 and and also longer term we will see\n41:42 much more um much more results in that.\n41:46 So before going into like specific\n41:48 examples of learning uh we want to touch\n41:51 back into that our density rata\n41:54 basically what we expect from a true\n41:56 continer learning algorithm um and see\n41:59 how the algorithm that I just described\n42:02 kind of like check any one of these\n42:04 requirements. Uh so we as we said before\n42:07 what we want is basically performance.\n42:09 We want an algorithm like the model to\n42:11 learn. Uh we want no catastrophic\n42:13 forgetting and we want to be able to\n42:15 learn from arbitrary non-corated data.\n42:18 Um let's start with performance. Um so\n42:22 as I mentioned before our algorithm is\n42:24 an on policy algorithm which means that\n42:26 it's a feedback style algorithm. The\n42:29 model the student in a student mode roll\n42:32 out its own trajectories. it's tried to\n42:34 solve whatever task and going over this\n42:37 trajectory it's get feedback. Uh now\n42:40 there is very classic results um from\n42:43 2010 um from the Dagger paper a very\n42:46 famous paper in robotics um that show\n42:50 that basically if you do on policy\n42:52 learning you'll supposed to get much\n42:55 better just performance in your task and\n42:58 why is that let's look at the other\n43:00 option to do like teacher forcing\n43:02 basically to just look at teacher\n43:04 demonstrations and to understand the\n43:06 pain point here let's look at this like\n43:08 two example of like autonomous driving.\n43:10 So we have this car driving this loop\n43:12 and it's follows like it have a bunch of\n43:15 expert demonstrations in blue here and\n43:18 it's trying to learn from them. The\n43:20 problem is that all of these expert\n43:22 demonstrations are only of a car driving\n43:25 quite well in the middle of the road.\n43:27 What will happen that during inference\n43:29 time you know you deploy, you learn, you\n43:30 deploy the model. Now the model is not\n43:32 perfect and sometimes it will get very\n43:34 close to the wall but this area of being\n43:36 very close to the wall has no coverage\n43:39 in your training data. Basically it's\n43:41 creates some train test distribution\n43:44 shift between um the learning the\n43:47 training and the inference um and you\n43:49 get to an area where you didn't learn\n43:52 over and basically you're screwed. The\n43:54 model does not know what to do. Now this\n43:57 is kind of extreme like examples but we\n44:00 see it's also true um in any kind of\n44:03 learning as long as you have this like\n44:05 sequential decision making this you know\n44:08 when the model output a sequence of\n44:11 actions or a sequence of tokens as in\n44:13 LLMs if you just follow the teacher you\n44:16 get a lack of robustness that lead to\n44:18 just poor performance um and this was\n44:22 known also for like normal on policy\n44:24 distillation versus offline distillation\n44:26 from a big model to a small model that\n44:29 on policy distillation just give you\n44:30 better performance and therefore we that\n44:33 is one of the reasons we chose to focus\n44:35 on on policy self distillation another\n44:38 one and I think we touched it a bit\n44:40 before about my hour razor paper where\n44:43 we showed that on policy learning tends\n44:45 to forget less so we did a very like\n44:48 simple experiment where we took the same\n44:50 task the same set of prompts and train a\n44:53 model with either RL or SFT and measure\n44:56 not just how much we improve on the new\n44:59 task as we see in the x-axis on the plot\n45:02 on the left but also how much we forget.\n45:05 We took a set of eight non-benchmarks\n45:08 like ifal and truthful QA, MMLU and so\n45:12 on and we basically evaluated like how\n45:14 much degradation of capabilities we have\n45:16 when we learn a new a new task in this\n45:18 case tool use and something that we\n45:21 found is that RL tends to learn without\n45:25 forgetting while for SF in order to\n45:27 improve your model you need to sacrifice\n45:30 prior capabilities and this is by the\n45:32 way not only true for LLMs we try with\n45:34 robotic foundation models and even with\n45:36 three layer MLP on amnest and just to\n45:38 try to give like a short intuition about\n45:41 why is it happen is because that in a\n45:44 lot of these problems there's not only\n45:46 one policy that can get to let's say 90%\n45:50 success on the new task. It's a whole\n45:52 set of them and which one you converge\n45:55 to will affect how much forgetting\n45:57 you'll have and RL just have a tendency\n46:00 um like on policy methods such as RL\n46:04 have a tendency to converge to one that\n46:06 are as close as possible to the original\n46:09 policy even without any explicit KL\n46:12 regularization or something like that.\n46:14 So this kind of like um implicit bias\n46:19 towards like minimum change keep the\n46:22 models from forgetting more and more and\n46:24 we took advantage of that and said okay\n46:27 if our self-distillation is an on policy\n46:30 self distillation it will also forget\n46:32 less than if we'll do like let's say\n46:35 offline self distillation we just\n46:38 generating data from the teacher and\n46:40 learning from them and so like we know\n46:42 that like we can get good performance\n46:44 And we can learn without forgetting it.\n46:46 Another the question is like can we\n46:48 learn from arbitrary kind of data and I\n46:51 think this is what we really work on\n46:53 quite a lot in our series of papers to\n46:56 to show that this is possible that this\n46:59 is not just like a one algorithm that\n47:01 learned from one type of learning signal\n47:04 such as verifiable rewards but this is\n47:07 like a really core and versatile kind of\n47:11 learning paradigm. and we go through\n47:13 some of these examples now. Yeah. So the\n47:16 first and the most I think intuitive one\n47:18 um that we like to start from is\n47:20 learning from demonstrations. Basically\n47:22 you have the expert demonstration you\n47:24 can put it in context um and ask the\n47:27 model to you know the teacher model to\n47:30 produce its own response which means\n47:32 that given the condition and the right\n47:34 response it will be a true one. And this\n47:37 is like a different use of that. This is\n47:39 the same data that we use in SFT but a\n47:42 new algorithm that learned from it. And\n47:45 as we see in the plot here on the right,\n47:47 we see two things again. We have on the\n47:49 x-axis the new task accuracy and the on\n47:51 the y-axis the prior task performance.\n47:54 And we see that SDFR self-distillation\n47:57 algorithm is able to both get better new\n48:00 task accuracy again because of that\n48:03 train test distribution mismatch that\n48:05 does not happen in policy learning and\n48:08 learn with almost no forgetting which is\n48:11 exactly what we want to get from a\n48:13 continual learner algorithm. And this\n48:16 also enable us to do what we call like a\n48:19 real continual learning experiment.\n48:21 basically not just improving taking one\n48:23 task and saying oh look we learn without\n48:25 forgetting but actually taking a model\n48:27 and trying to like improve it on a\n48:29 series of tasks. So we have tool use\n48:31 scientific QA medical questions and\n48:35 we're asking okay can we get a model\n48:36 that just goes data set by data set and\n48:40 improve on all of them and here uh we\n48:43 see the on policy service distillation\n48:45 is actually able to do it like you every\n48:48 time you change from one train that data\n48:52 set to another you have a slight\n48:54 degradation of performance but overall\n48:56 you are able to retain and aggregate. uh\n48:58 this is unlike supervised training\n49:00 normal SFT that once you change one data\n49:04 set to the other the performance on the\n49:06 data you learned before automatically\n49:09 dropped. So you are able to only learn\n49:10 one task at a time and not really\n49:12 aggregate. Um and this is for me at\n49:15 least very exciting because this points\n49:17 to the real you know continual learner\n49:19 that we talked about and think that it\n49:21 can be possible. Uh I really like that\n49:24 that figure because like it showed um\n49:27 really clearly like the difference\n49:28 between the two. Um but uh I was really\n49:31 curious like it just gut feeling here,\n49:33 right? How much data set diverse data\n49:37 set do you need to pile up here in order\n49:39 to make the on policy selfation like a\n49:43 crumble? like do you think it it will it\n49:46 will show like massive degradation at\n49:49 some point or you you still think it's\n49:52 going to be uh robust?\n49:54 >> That's a really good question, right? We\n49:55 still need to like push it to in scale\n49:58 to see where things break. That is kind\n50:00 of like the downside of doing kind of\n50:02 academic research where at some point\n50:04 you're saying, okay, I ran out of\n50:06 compute.\n50:08 But I do think that over like my closest\n50:12 point of comparison is I think Nvidia\n50:15 had the Cascade Neotron paper where they\n50:18 did RL on a series of tasks and if I\n50:20 remember correctly they had six\n50:22 different task and each one of them had\n50:24 was much bigger data set uh than the one\n50:27 that we used. you know if they did to\n50:29 use they did like a proper you know like\n50:32 tens of thousands of tulus problems and\n50:35 there with on policy RL they were able\n50:38 to show that like you can keep learning\n50:39 and basically have no degradation at all\n50:42 even after like if I remember correctly\n50:45 like thousands of GPU hours um and since\n50:48 like the core mechanism the on policy\n50:51 like learning mechanism is the same\n50:53 mechanism I expect SDFT to do the same\n50:57 um being able to just keep aggregating\n50:59 performance. Of course, you can see at\n51:01 some point you have like when you we\n51:03 move from tulus to science for example,\n51:04 you see the tulus plot drop a bit. Yeah.\n51:07 >> By like around 10 15%.\n51:10 >> So I feel like this will happen anyway.\n51:12 You know, it's kind of you move from\n51:13 your local minima to a neighborhood that\n51:16 is still good but not as good.\n51:18 >> You you you kind of sacrifice a bit of\n51:20 like specialization for like\n51:22 generalization which is which is like\n51:24 fair game, right? I think it's it's good\n51:25 because even with human like one guy\n51:28 that just like look at one thing right\n51:31 versus like somebody that has like 17\n51:34 different skill set uh okay like the\n51:37 generalist will be like more useful\n51:38 generally but like when you get to\n51:39 something that is that need require like\n51:41 just doing one you you're much better\n51:43 getting a specialized like person but\n51:46 like looking at this plot and then\n51:48 looking at the run on the right I'm\n51:50 thinking like like during the the I\n51:54 don't know like the frontier\n51:55 pre-training post-training pipeline\n51:58 >> that like does it make sense to we weave\n52:00 in RL a bit earlier or like like self\n52:02 distillation like this earlier in the\n52:04 process versus like just jamming\n52:07 supervised fine tuning in there.\n52:08 >> I I think it completely makes sense and\n52:10 this is actually correspond to what we\n52:12 see frontier frontier labs start to do.\n52:15 Basically if you look at like recent\n52:17 paper you see that they kind of took all\n52:19 of their SFT data and pushed it into\n52:22 pre-training. So now pre-training is\n52:25 kind of a mixture of like random\n52:27 internet data and a very well structured\n52:29 high quality SFT data sets and so this\n52:32 all happen at once. So you don't have\n52:34 this sequential training because they\n52:37 know that the sequential training is bad\n52:40 and then after that you just move to a\n52:42 big RL training.\n52:43 >> Yeah.\n52:44 >> Um and in the RL training since it's on\n52:46 policy you can allow yourself to have\n52:48 different phases right math and then\n52:50 code and etc. But at least they stopped\n52:53 doing like a separate SFT at least some\n52:55 of them.\n52:56 >> One one caveat which I would add here is\n52:58 that in some cases you might do some\n53:02 expose yourself sequentially to\n53:04 environments which partially contradict\n53:06 each other.\n53:07 >> Right? We have one example in Thomas's\n53:09 paper actually where you interact with\n53:11 users and it can be the same user and\n53:13 they preference changes. Right? But in\n53:14 the beginning that user wants like u\n53:17 very detailed answers and eventually\n53:19 that changes to very concise answers. Of\n53:21 course, there's no way to be good with\n53:23 respect to both. They're very they're\n53:25 contradictory, right? And so I think to\n53:27 some extent that's that's driving most\n53:30 of the degradation you would see in on\n53:32 policy methods as you stack up your\n53:35 training. And so a question if you are a\n53:40 model designer is either how do I sort\n53:45 or organize my tasks?\n53:47 >> Yeah.\n53:48 um in such a way such that you know I\n53:51 minimize the adverse effects so that I\n53:54 kind of prioritize my tasks correctly so\n53:57 that I train last on the tasks which I\n53:59 want to prioritize most or alternatively\n54:01 and this is I think what I've seen from\n54:03 a lot of the recent frontier labs is\n54:07 that they have some relatively small\n54:11 multi-stage\n54:13 training step at the end where they mix\n54:16 so they do is sequential on policy\n54:18 learning for the bulk of the training\n54:20 and then they do have a very small stage\n54:21 where they take some data from all of\n54:23 these different environments and then\n54:25 train with respect to a checkpoint. They\n54:28 do on policy distillation with respect\n54:29 to a checkpoint that was already good at\n54:31 that task\n54:32 >> to just it's just a way of reweing your\n54:35 priorities at that point. It's not a way\n54:36 of of trying to learn a new a new skill\n54:39 anymore, but it's just a way okay I\n54:41 maybe I first trained on tool use but\n54:43 then in the end I trained on science and\n54:45 medical but it's actually not the case\n54:46 that I want the model to focus on\n54:49 medical and kind of sacrifice its tool\n54:52 use priority but I want to have a more\n54:54 balanced\n54:55 >> I think like it made me think so because\n54:58 I've talked to the founder the sorry the\n55:02 head of AI at Matt Academy and he was\n55:04 like he's very heavy on is this is human\n55:07 stuff, right? But it's very avon like\n55:09 having this hierarchical kind of\n55:10 learning paradigm which is kind of like\n55:13 the same thing here like you you would\n55:14 like put the the different training\n55:16 blocks in a way that makes sense. So if\n55:19 tool use is going to be needed in one\n55:22 data set in the environment and data set\n55:23 later on you much better learn it here\n55:25 because otherwise you're going to learn\n55:27 tool use while you're supposed to do\n55:28 like browsing and and stuff like that.\n55:30 So I think that makes sense and like I\n55:32 really like the idea of I don't know\n55:34 letting the model figure it out a bit\n55:35 right like you you look at the different\n55:38 kind of data set and then try to like uh\n55:41 conceptualize like yeah I know about\n55:43 these topic roughly I'm going to\n55:46 organize it roughly like that for now\n55:48 and then go into the the sequence. So\n55:50 the second kind of learning that we'll\n55:52 touch upon is learning knowledge. And\n55:55 this is really cool because knowledge\n55:57 for example is something that you cannot\n55:59 learn with normal RL. It's kind of like\n56:02 doesn't matter how many you know like\n56:05 what kind of verifiable reward I'll\n56:07 have. If for example I'll take a model\n56:09 with uh 20 end end of 2024 knowledge\n56:13 cutoff and we'll try to train it on a\n56:16 data set of like what happened in 2025\n56:18 as we did like no matter how many you\n56:21 know how much exploration it will do how\n56:24 many guesses it will have what will be\n56:25 your group size it will not be able to\n56:27 kind of like guess correctly. So this is\n56:30 basically a kind of learning that we're\n56:31 not able to do um with on policy\n56:34 algorithm before but now with self\n56:37 distillation it is completely possible.\n56:39 You just put the text of whatever new\n56:41 information new knowledge you want to\n56:43 give to the model. You give it to the\n56:45 teacher. You put it in context and it\n56:48 does just train itself. We compared it\n56:51 in our paper to um few baseline. first\n56:54 of all a normal you know like in context\n56:57 learning or rug um we also compared to\n57:01 CPT continued continual pre-training\n57:04 which is a very standard of just next\n57:07 token prediction on the document and\n57:10 also what we call like supervised\n57:12 training but is like kind of self data a\n57:15 self-study algorithm that was\n57:17 popularized recently by Microsoft which\n57:20 is taking the document generating a huge\n57:23 data a set of synthetically generated\n57:26 questions on it and then train the model\n57:28 to answer on them using normal SFT and\n57:32 of course we compare it to our on policy\n57:35 self distillation.\n57:37 So the base if we look at the result the\n57:40 base model of course is not aware at all\n57:43 of the new like information from 2025.\n57:46 So it's get zero boss in like what we\n57:48 call accuracy both strict and lenient\n57:51 accuracy just we ask the model questions\n57:54 about this event that happened during\n57:56 2025 and expected to answer and also on\n57:59 all the accuracy what is all the\n58:01 accuracy here we checked a bit for\n58:02 generalization so this is not like a\n58:04 direct question about you know uh what\n58:07 happened in this storm during 2025 but\n58:10 it's more like a questions that need to\n58:12 use the knowledge from in a more general\n58:16 aspect. For example, what were the 10\n58:19 biggest natural disasters in the last 10\n58:21 years where the new knowledge should\n58:25 change the way you answer them because\n58:27 some of these 10 biggest natural\n58:29 disaster happened in the last year but\n58:31 it's not directly you don't like you\n58:33 know it's kind of you don't ask kind of\n58:35 the model specifically on events uh\n58:38 which require this is a way to see how\n58:40 much of the new knowledge was actually\n58:42 incorporated into the what the model\n58:44 actually knows about the world its world\n58:46 model and then we compare it to like CPT\n58:50 uh which do terribly like um barely able\n58:54 to learn anything SFT which we know like\n58:57 from previous work they're doing kind of\n58:59 well I able to get like around 80%\n59:02 strict accuracy but doing not that well\n59:04 on the OD accuracy so basically if you\n59:06 do SFT you kind of memorize answers um\n59:10 this is a very known problem with like\n59:12 offline algorithm where you do teacher\n59:14 forcing\n59:15 Well, when do you do SDFTR algorithm you\n59:18 are able to get better accuracy but also\n59:21 mainly better OD accuracy which means\n59:23 that you actually incorporated the new\n59:25 knowledge into the model. So to preface\n59:28 that I want to kind of start by\n59:30 discussing in which ways on policy\n59:32 methods or current on policy methods\n59:34 have been bottlenecked and that touches\n59:36 also on what Eden mentioned on why for\n59:38 example they would not be able to learn\n59:39 knowledge. So the first way in which\n59:42 they were bottlenecked is that they\n59:44 methods such as GRPO or general methods\n59:46 in RLVR so reinforcement learning with\n59:49 verifiable rewards receive only a scalar\n59:51 signal per roll out from their\n59:53 environment. And that of course\n59:54 bottlenecks how much they can learn from\n59:57 their environment. For example, if you\n59:58 want to learn knowledge through a scalar\n1:00:00 signal that will be very difficult. And\n1:00:02 then the second uh bottleneck is that\n1:00:04 then this already weak signal is used\n1:00:08 for rollout level credit assignment.\n1:00:10 Meaning the policy is not trained in\n1:00:13 particular on a specific token or shown\n1:00:16 that a specific token was good or bad\n1:00:17 but it's just being shown that the\n1:00:19 entire rollout was good or bad. So\n1:00:21 there's here's one example that\n1:00:22 illustrates this and we also had this in\n1:00:25 one of our papers and the example is uh\n1:00:28 the question is write a Python function\n1:00:31 that returns all numbers from 1 to n\n1:00:33 answer briefly and then in a normal\n1:00:36 process and also in our training in on\n1:00:38 policy self distillation we would sample\n1:00:40 a response from the model let's say it\n1:00:42 would be this Python program and as you\n1:00:44 can see it returns a list that ranges\n1:00:47 from one all the way through n so\n1:00:49 including N and then a feedback could be\n1:00:52 not to include N. And so what would GRPO\n1:00:55 do or RLVR methods typically it would so\n1:01:00 in GRPO's case it would just say all\n1:01:03 tokens were bad. So it would receive a\n1:01:05 signal a negative uh reward or at least\n1:01:09 on average with respect to the rollout\n1:01:11 group might be negative and so in that\n1:01:13 case then it would just downweight the\n1:01:15 probability of all tokens. make would\n1:01:17 make all tokens less likely. But instead\n1:01:20 what on policy self dislation would do\n1:01:22 is would look over the generated tokens\n1:01:24 and for each ask in hindsight given that\n1:01:28 you know the feedback was not to include\n1:01:30 N what would you have still generated\n1:01:32 that token and as you can see here STPO\n1:01:35 and this is a real example this is with\n1:01:37 quen 38B STPO here doesn't change any of\n1:01:40 the tok next token predictions for any\n1:01:44 of the tokens except for this one token\n1:01:46 plus that follows the N right because\n1:01:49 that plus token was the cause why we\n1:01:51 included N. And so this this is\n1:01:53 basically the intuition. So both STPO\n1:01:56 uses this richer signal this textual\n1:01:59 signal that describes okay we don't want\n1:02:01 to include N in our list that we return\n1:02:03 and it produces dens densor credit\n1:02:06 assignment because it actually says okay\n1:02:08 here this token you should change but as\n1:02:10 Eden also mentioned before it doesn't\n1:02:12 only do that over the generated tokens\n1:02:17 it does this over the entire vocabulary\n1:02:20 so at every next token prediction the\n1:02:23 hindsight policy or the teacher asks\n1:02:27 okay any of the possible next tokens\n1:02:29 here at that position should they now\n1:02:31 become more likely or less likely. So\n1:02:34 then in this example you would get let's\n1:02:36 just look at the position of the plus\n1:02:39 because that's uh maybe most intuitive.\n1:02:41 You would not only say that the plus\n1:02:43 here should become much less likely. You\n1:02:45 would also say the alternative how you\n1:02:47 should have continued at that position\n1:02:49 should be to just have closed the list\n1:02:52 right and completed and that becomes\n1:02:54 much more likely as is indicated by this\n1:02:56 blue color. So how does this manifest?\n1:02:59 So if we look here now as an at at an\n1:03:01 example where there's this type of\n1:03:04 behavior but aggregated now across\n1:03:06 multiple rollouts you see this\n1:03:07 interesting pattern. So, so GRPO\n1:03:10 basically asks was a response better or\n1:03:13 worse than average across a group\n1:03:15 rollout and then would make all tokens\n1:03:17 either more likely which would be blue\n1:03:19 here or less likely. And in contrast,\n1:03:23 SDPO how we called the um on policy self\n1:03:27 distillation in this RL context would\n1:03:29 ask was this particular token good or\n1:03:32 bad in hindsight given the additional\n1:03:34 context and then the policy would\n1:03:36 comment on each individual token or even\n1:03:40 on each individual next to possible next\n1:03:42 token prediction. And so one natural\n1:03:45 setting that we looked at was the RVR\n1:03:48 setting. So the typical setting where\n1:03:50 RVR methods are applied such as GRPO\n1:03:55 which is the setting where you learn\n1:03:56 from success and failure where you have\n1:03:58 some environment that just tells you\n1:03:59 whether your response was correct or\n1:04:03 your response was wrong. Our first\n1:04:05 question was okay even in this kind of\n1:04:07 environment where where we don't get any\n1:04:10 additional signal from the environment\n1:04:12 can we still benefit from better credit\n1:04:15 assignment and so how you would do\n1:04:17 self-distillation here is very simple so\n1:04:18 you would do similar as in GP you would\n1:04:20 do multiple rollouts per question and\n1:04:23 then you would have some correct\n1:04:25 attempts and you would have some wrong\n1:04:26 attempts and then you can just put the\n1:04:28 correct attempts in the context of the\n1:04:30 teacher for the wrong attempts in the\n1:04:32 same way as Edan described how you would\n1:04:34 use self distillation when you actually\n1:04:36 already have correct demonstrations.\n1:04:38 Here you would use these correct\n1:04:40 demonstrations but these would be\n1:04:41 generated by the model itself as it\n1:04:43 explores in the environment. And so\n1:04:46 here's one example. Um this is on a\n1:04:49 chemistry reasoning data set. And what\n1:04:51 we saw here is that we saw two things.\n1:04:53 We saw that SDPO both converges here to\n1:04:56 a much higher accuracy but it also\n1:04:59 converges much faster in training wall\n1:05:02 clock time which this fast convergence\n1:05:05 we generally saw because as we discussed\n1:05:08 it just provides much denser credit\n1:05:10 assignment so a much richer update\n1:05:12 signal and another thing which I know we\n1:05:14 don't go in here right now is that we\n1:05:16 also typically saw that it learns more\n1:05:18 efficient reasoning so it uses more or\n1:05:20 it uses rather less of these typical\n1:05:23 ical reasoning tokens such as H and\n1:05:25 weight etc which GRPO tends to produce\n1:05:29 >> and um another thing that I think I've\n1:05:31 seen in the paper is that like they're\n1:05:33 also way shorter than like um the GRPO\n1:05:37 one uh do you have any thought about\n1:05:39 like why exactly like I know that like\n1:05:42 the the just standard GRPO formulation\n1:05:44 like when it's wrong it's going to be\n1:05:46 wrong for longer and stuff there has\n1:05:47 bias for that uh but why is it like that\n1:05:51 much uh more efficient shorter.\n1:05:53 >> Um, yeah, it's a great question. I think\n1:05:55 it's um it touches on what we discussed\n1:05:57 earlier, namely that in hindsight when\n1:06:00 you kind of critique how you would have\n1:06:03 solved the problem, you you often think,\n1:06:06 okay, I could have shortened my response\n1:06:09 like there was a more direct path to\n1:06:11 responding. And so to some extent I\n1:06:15 think it's a function of this hindsight\n1:06:17 policy being able in a more informed\n1:06:20 state being a able able to tell the\n1:06:22 policy okay here this was like a\n1:06:25 circular reasoning loop that wasn't\n1:06:27 unnecessary and it's penalizing the\n1:06:30 policy for that at the same time though\n1:06:33 um I think it's also the case that this\n1:06:36 points to a particular problem in GPO\n1:06:39 which is has also been widely studied at\n1:06:41 this point that even though you know I\n1:06:45 think parts of the core intuition when\n1:06:47 GRPO was first introduced and then also\n1:06:49 when the deepse R1 paper came out was\n1:06:51 this idea okay wow we can see that our\n1:06:53 policy learns to produce longer and\n1:06:55 longer responses and has these kind of h\n1:06:57 tokens and starts to actually quote\n1:06:59 unquote think a lot of work since then\n1:07:02 has shown that you know often these like\n1:07:04 not always but sometimes these think\n1:07:07 additional think tokens they're not\n1:07:09 necessary for good generalization\n1:07:12 And they are kind of I think they're\n1:07:14 surfacing this form of weak credit\n1:07:17 assignment that GRO performs in the\n1:07:19 sense that GO will just upweight things\n1:07:22 that tend to work better on average on a\n1:07:26 very coarse average which are then\n1:07:28 sometimes these approaches which just do\n1:07:32 think things five times over in the same\n1:07:34 way. And we do saw we we did saw this in\n1:07:37 the paper that we got the circular\n1:07:39 reasoning in GPO even explicit circular\n1:07:42 reasoning where GPO would or the model\n1:07:44 trained with GPO would say I am running\n1:07:46 like I'm I'm running in circles or or\n1:07:48 things like that. Maybe one thing which\n1:07:52 we don't have to go in that much detail\n1:07:53 because we already covered it at this\n1:07:55 point which I wanted to mention is that\n1:07:56 what we generally saw in the RL case but\n1:08:00 also in the um in the cases that Eden\n1:08:03 described before where we compared\n1:08:05 against SFT when learning from\n1:08:06 demonstrations is we saw that as you\n1:08:08 scale models you get better in context\n1:08:10 learners and those translate to better\n1:08:12 self- teachers. So better teacher\n1:08:14 signals that then lead to better student\n1:08:16 models as we train them with on policy\n1:08:18 self distillation. Maybe that's a very\n1:08:20 intuitive thing. So maybe the the one\n1:08:23 major thing which we want to cover which\n1:08:24 we're very excited about is how these\n1:08:27 methods can really uncover new data or\n1:08:30 unlock new data modalities that we can\n1:08:32 use for training. And one very natural\n1:08:34 one is this idea of learning from which\n1:08:36 rich feedback. So here's just one very\n1:08:39 practical example to illustrate this\n1:08:41 where the question is how much impulse\n1:08:42 did the thrusters generate for the Mars\n1:08:44 climate orbiter and let's say the model\n1:08:47 answer would be 100 pound for seconds\n1:08:50 and as I mentioned in normal ROVR the\n1:08:53 model would just receive a binary reward\n1:08:55 and so in this case because the answer\n1:08:56 is wrong it would be just a negative\n1:08:59 reward but in many cases in in many real\n1:09:02 environments you would have some richer\n1:09:05 signals some denser feedback back loop\n1:09:08 where the feedback would be much more\n1:09:09 informative as to what the model\n1:09:11 actually did wrong. So here in this case\n1:09:13 that it should answer in Newton seconds\n1:09:15 and there's many examples of this kind.\n1:09:18 There's and two examples we looked at\n1:09:21 primarily there's code environments for\n1:09:23 example which produce runtime errors\n1:09:25 failed unit tests etc. And there's real\n1:09:28 user conversations and I want to focus\n1:09:31 maybe on the real user conversations. So\n1:09:33 one really nice experiment this was led\n1:09:36 by Thomas um was but we asked okay can\n1:09:40 we learn from raw user conversations\n1:09:43 user conversations in the wild and can\n1:09:45 they actually improve models and what we\n1:09:48 did here is we took 14,000 real world\n1:09:51 user conversations from a data set\n1:09:53 called wild chat which was produced some\n1:09:55 years back by Alen AI and then what we\n1:09:59 did is we did kind of the natural thing\n1:10:00 which we discussed we um split this into\n1:10:03 triplets where each triplet consists of\n1:10:05 the prompt or the history of the\n1:10:07 conversation up to that point. Then the\n1:10:09 model response and then the follow-up\n1:10:11 user response which would indicate maybe\n1:10:14 what was an issue a possible issue with\n1:10:17 the previous assistance response and we\n1:10:19 would train with on that with on policy\n1:10:21 self-distillation and so what we then\n1:10:24 did is then we evaluated the trained\n1:10:26 model on a diverse suit of benchmarks.\n1:10:29 So for alignment, for instruction\n1:10:32 following, for reasoning, for creative\n1:10:33 writing and also for knowledge here with\n1:10:36 MMLU Pro. And what we saw here is that\n1:10:38 on several of these benchmarks and here\n1:10:40 in particularly for alignment, uh\n1:10:42 reasoning and creative writing, the\n1:10:44 model substantially improved here, this\n1:10:46 is with Quen 38B, but we also tried this\n1:10:47 with other models and this was very\n1:10:49 surprising to us. Why? Well, because\n1:10:52 this data that we trained on is quote\n1:10:54 like in some sense free data. As Edan\n1:10:56 also mentioned before, this is data\n1:10:58 which we just get through running our\n1:11:01 deploying our model and letting it\n1:11:03 interact with users. It's not the the\n1:11:05 user conversations are raw and it's a\n1:11:07 very weak data set because it's only\n1:11:09 14,000 conversations, right? Because\n1:11:11 this data has not been so useful in the\n1:11:13 past because we didn't know how to learn\n1:11:14 from it. We didn't really have good data\n1:11:16 sets that collected a lot of data. And\n1:11:18 of course, the big companies that\n1:11:19 perform a lot of inference, they have a\n1:11:21 lot more. So most computer day is spent\n1:11:23 on inference but we didn't know how to\n1:11:24 leverage this interaction for training.\n1:11:26 And what we're excited about in terms of\n1:11:28 this result is that it seems that on\n1:11:30 policy self-distillation enables\n1:11:32 scalable learning without requiring\n1:11:36 explicit rewards just by raw interaction\n1:11:38 with the environment and receiving\n1:11:40 textual feedback. So this is like one\n1:11:42 example of how you can learn from this\n1:11:45 raw feedback from humans but it's like a\n1:11:47 cross population of humans, right? So\n1:11:49 it's still a typical post-training\n1:11:51 objective. The one last thing which we\n1:11:54 wanted to discuss here in this in these\n1:11:56 slides is how can we use on policy self\n1:11:59 distillation toward for a really quote\n1:12:02 unquote continual learning system that\n1:12:04 is deployed in the wild and I'm going to\n1:12:08 give a few examples of what does that\n1:12:11 even mean and how could such a system\n1:12:13 look like. So one early example that we\n1:12:16 started looking at was how can we\n1:12:18 discover solutions to very hard\n1:12:20 problems. So we didn't want to go with\n1:12:22 super hard problems. Um so what we did\n1:12:24 is we took problems from coding task\n1:12:26 from life codebench that the model was\n1:12:28 not able to solve across a lot of\n1:12:31 attempts. And then what we were\n1:12:33 interested in is how quickly would the\n1:12:35 model discover the solution to that\n1:12:37 task. And how we quantified that was\n1:12:39 through this discovery at K metric. So\n1:12:41 this is just the probability of solving\n1:12:44 a task within K environment\n1:12:46 interactions. Right? So this is what I'm\n1:12:48 going to plot here. And the tasks we\n1:12:51 considered here were really hard tasks.\n1:12:53 So these were tasks where the pass at 64\n1:12:55 was less than 3%. Meaning that if you\n1:12:58 were to 100 times each time sample 64\n1:13:01 solutions, only in three of these 100\n1:13:04 cases, the model would actually have\n1:13:06 sampled any correct solution in these 64\n1:13:08 attempts. And then the the simplest\n1:13:10 baseline is is what's typically called\n1:13:13 best of K which just repeatedly samples\n1:13:16 um solutions from the base model and we\n1:13:19 wait until it has found a and so\n1:13:21 actually for best of K this discovery at\n1:13:23 K metric is is the same as pass at K\n1:13:25 because here there's no sequentiality\n1:13:27 it's just repeated sampling and in some\n1:13:29 sense the best of K um baseline here is\n1:13:33 what would correspond to if you were to\n1:13:35 try to run something like GRPO here\n1:13:37 right because GRPO does not have a\n1:13:38 learning signal, any learning signal\n1:13:40 until you get the first solution. And of\n1:13:43 course, the feedback from the\n1:13:44 environment here would just be runtime\n1:13:46 errors and code unit tests. And so what\n1:13:49 we saw is that in on these hard tasks\n1:13:53 running self dissolation got a\n1:13:54 significant speed up over best of K and\n1:13:57 also another baseline which is just a\n1:13:59 multi-turn baseline which keeps all of\n1:14:01 the conversation history with the\n1:14:03 environment in context until it runs out\n1:14:05 of context and then has a first in first\n1:14:07 out queue. And so what this told us,\n1:14:11 this experiment is that self-dislation\n1:14:13 can really learn to solve hard tasks\n1:14:15 even before it ever solved the task,\n1:14:17 right? Just by the teacher providing\n1:14:19 directionally accurate feedback that\n1:14:21 points towards how to solve a task. And\n1:14:23 this was really one of the first ways in\n1:14:25 which we applied self-disciplination\n1:14:28 online in a continual way when given one\n1:14:31 particular task. And so based on this,\n1:14:34 people took this way further and this is\n1:14:36 super exciting. So I think very shortly\n1:14:38 after we put out this paper, someone on\n1:14:41 on on Twitter uh had this idea of\n1:14:44 continual code. So like cloud code but\n1:14:47 running a local model and that model is\n1:14:50 actually learning as you go. It's not\n1:14:52 just saving things in context or\n1:14:54 scaffolding. It's actually updating the\n1:14:56 model parameters. So here you have a GIF\n1:14:58 running and basically as he interacts\n1:15:01 with the model whenever the model does\n1:15:03 something that they don't want the model\n1:15:06 does some update. So here it's it's\n1:15:09 thinking and then he will reject the\n1:15:11 edit and he will say that he wants the\n1:15:14 helper to be minimal and some other\n1:15:17 instruction and then the model would\n1:15:18 actually do a training step. So it was a\n1:15:20 first sketch of an idea of like a\n1:15:22 continual learning system that actually\n1:15:24 updates weights as you interact with it\n1:15:26 naturally. And then I think over the\n1:15:28 last few weeks this has been picked up\n1:15:30 by a library called openclaw RL which is\n1:15:34 running on policy selfisolation under\n1:15:36 the hood but is extending this way\n1:15:38 beyond just coding agents but putting\n1:15:40 this into open claw. So having your\n1:15:42 agent interact with whatever tools you\n1:15:45 give it access to and then having the\n1:15:48 agent actually learn over time and in a\n1:15:49 synchronous fashion. And so that's all\n1:15:51 these things we're super excited by\n1:15:54 because these really point to a future\n1:15:56 where you have a model that learns\n1:15:58 online as you are interacting with it.\n1:16:01 And so this was kind of an overview of\n1:16:03 that. Just one more mention is that\n1:16:07 while there have been the three papers\n1:16:09 um that uh Edan and Thomas and I have\n1:16:12 been been leading there's also been a\n1:16:14 lot of other really cool research that\n1:16:16 has come out and here's just a few\n1:16:18 papers um all research that has come out\n1:16:21 in the last month or so. So yeah that's\n1:16:23 all and thanks a lot. I'll also add one\n1:16:26 thing that um both SDFT and SDPO the two\n1:16:30 versions of the self distillation\n1:16:31 algorithms are available with TRL um in\n1:16:36 the last week hugging face people merge\n1:16:39 um an implementation in their codebase\n1:16:42 so you know it make it much easier for\n1:16:44 everyone to play with these ideas.\n1:16:46 Thanks a lot for the the presentation.\n1:16:47 You guys there's Shannon Sans from um\n1:16:50 new research. They want to know like\n1:16:52 okay like this is what you guys have\n1:16:54 been working on. What are you guys\n1:16:56 working on now um in this direction? Can\n1:16:59 you share a bit with this?\n1:17:00 >> For me one of the kind of like there are\n1:17:04 two things that remain to do at least in\n1:17:06 my eyes in this field. One is just\n1:17:08 scaling up and that kind of hard to do\n1:17:11 in the academia and we hope that you\n1:17:13 know various frontier labs will just\n1:17:15 take this idea and uh scale them. The\n1:17:18 other one um is which is more of like\n1:17:21 not in the scaleup but more in the last\n1:17:23 example that Yonas gave where like you\n1:17:25 have one user interact with one agent\n1:17:28 and you want to improve on this\n1:17:30 conversation is on the opposite side is\n1:17:33 about sample efficiency where you don't\n1:17:35 have scale. You just have one user\n1:17:37 provide you know a few points of\n1:17:38 feedback and the question is how can you\n1:17:40 learn from that and like in context\n1:17:42 learning can learn from that like we\n1:17:44 know that put even one sentence you know\n1:17:46 into the context and it will change the\n1:17:48 model behavior quite well but self\n1:17:51 distillation although it's able to like\n1:17:53 kind of propagate the same kind of\n1:17:55 behavior change it's still limited by\n1:17:57 the fact that it's doing um like\n1:17:59 gradient descent basically and we know\n1:18:01 that gradient descent only take you know\n1:18:03 small push the model a bit at a time\n1:18:05 that's like inherent in the algorithm.\n1:18:07 Um so currently we're looking into ways\n1:18:09 to make it the supervision even denser\n1:18:12 to make the update such that that even\n1:18:14 with a single point of environment\n1:18:17 feedback you can change the model quite\n1:18:20 a lot without forgetting. So you know an\n1:18:22 extreme case will be that if the user\n1:18:24 say never use the word the letter F\n1:18:27 that's it the model will stop use the\n1:18:29 letter F with one update to the weight.\n1:18:32 That's kind of for me that's the dream.\n1:18:34 Of course this is like a silly example\n1:18:35 but like you know kind of the same way\n1:18:37 that I don't need to tell you things\n1:18:39 thousand time in order for you to learn\n1:18:41 the concept. I want models not to be\n1:18:43 needed to tell thousand time in order to\n1:18:45 learn content. So this is kind of like\n1:18:47 what excites me these days and we have\n1:18:49 some ideas that hopefully in the next\n1:18:51 months or so we'll share about how to do\n1:18:53 it.\n1:18:53 >> Yeah. Yeah. Same here. So, um I I do\n1:18:57 think this is maybe the central at least\n1:19:00 one of the very central um immediate\n1:19:04 kind of questions. And I think these\n1:19:06 this is one like this this question of\n1:19:07 can we have a learning paradigm that can\n1:19:09 actually do parameter updates but be as\n1:19:13 learning efficient or as sample\n1:19:14 efficient as humans are. Right? This is\n1:19:16 like a big question and has been a big\n1:19:20 goal of the field and I do believe that\n1:19:22 right I I so I think one of part of the\n1:19:25 magic of in context learning has been\n1:19:27 that in context learning is somehow that\n1:19:29 right it's it is somehow extremely\n1:19:32 sample efficient and I think that's\n1:19:34 what's so excite has been so exciting\n1:19:36 about it and also so useful about it but\n1:19:39 also as we mentioned it's inherently\n1:19:40 transient so the question is can we do\n1:19:43 lasting learning but in the same sample\n1:19:47 efficiency as say in context learning or\n1:19:49 humans is a is a very appealing um\n1:19:52 question and I I do think like there's a\n1:19:55 lot of promise there and and then of\n1:19:57 course there's other things like also\n1:19:59 the the thing which I I sketched out\n1:20:00 earlier a bit was this this question of\n1:20:03 how can a model elicit the or like find\n1:20:06 the right feedback or seek out the right\n1:20:10 feedback from its environment\n1:20:12 >> there was a question from Jasper Lou uh\n1:20:15 from Figma. He's asked like has there\n1:20:17 been any work exploring this in a more\n1:20:20 subjective feedback space uh where like\n1:20:22 the feedback is like um not necessarily\n1:20:25 like hard feedback there the you show\n1:20:27 that with the wild um uh well chat but\n1:20:31 like is there uh um is there any hope\n1:20:35 when the feedback is not clear at all\n1:20:37 like when the model need to kind of\n1:20:39 interpret it or like generate its own\n1:20:41 feedback? I think a yes there has been\n1:20:44 several things which we looked at. So\n1:20:46 one in terms of uh subjective feedback\n1:20:49 something which I didn't mention or\n1:20:51 which we didn't mention in the\n1:20:52 presentation was part of what Thomas\n1:20:54 also looked at was personalization to\n1:20:57 user preferences. So as opposed to the\n1:21:00 wild chat uh example where we just did\n1:21:02 general alignment to a large population\n1:21:06 the question of okay if you interact\n1:21:07 with a user and the user exhibits\n1:21:08 certain preferences like I don't want\n1:21:10 emojis or I don't like sick offency\n1:21:13 right or I want I want short responses\n1:21:17 the user like the model is able to pick\n1:21:19 this up through on policy\n1:21:20 self-distillation and um so that is like\n1:21:24 a subjective type of feedback and then\n1:21:26 in terms of an imperfect type of feed\n1:21:28 feedback. I think that's more what we\n1:21:29 try to look at in the context of this\n1:21:32 solving hard tasks live code\n1:21:34 >> thing where where the feedback is\n1:21:36 inherently incomplete like it may be\n1:21:38 directionally right but if you just see\n1:21:41 a runtime error it doesn't tell you yet\n1:21:43 how to solve the problem right it might\n1:21:45 constrain your solution space in some\n1:21:47 way but it it doesn't leak the answer\n1:21:50 and I think both are important and\n1:21:52 there's lots of cool other things we can\n1:21:53 do\n1:21:53 >> I'll add that in cases where the\n1:21:55 feedback is not like very clear where\n1:21:57 like one need to reason and extract you\n1:21:59 know kind of what is the actual change\n1:22:01 that needed to be done. A very cool idea\n1:22:03 that one can do and we didn't like had\n1:22:06 time for it but I'm sure it will work\n1:22:08 and I'll hope someone will take up on\n1:22:10 that is just to let the model reason a\n1:22:12 bit you know give it the context and\n1:22:14 just before you get to the part where\n1:22:16 there is a new answer that you do self\n1:22:18 distillation over give it the\n1:22:20 opportunity to do some chain of thought\n1:22:22 understand a bit you know what is the\n1:22:24 thing you need to take out of this\n1:22:27 additional information yeah and I'm sure\n1:22:29 it will work I'm sure it will improve\n1:22:30 things and again I'm I'm sure we'll also\n1:22:33 see a paper doing that in the next few\n1:22:34 months. [snorts] Someone would take up\n1:22:37 on that.\n1:22:38 >> This is good. Um also like um uh um I\n1:22:44 don't know like the the quality aspect\n1:22:46 of the demonstration and like um uh\n1:22:50 what's your thought on that? like a um\n1:22:53 let's say you had like 100 okay uh like\n1:22:57 not high signal feedback versus like 10\n1:23:01 very excellent one. Um I know like the\n1:23:04 like the model like like to have\n1:23:06 multiple exampler and then like iterate\n1:23:08 on that. Um\n1:23:10 what which in in the current paradigm of\n1:23:12 self distillation which will they prefer\n1:23:14 here like the the high quality one or\n1:23:16 like the repeated the feedback ones?\n1:23:20 So at least I think from the experiment\n1:23:22 I did that like the more the hund um you\n1:23:27 know more examples but like medium\n1:23:29 quality will give you better performance\n1:23:33 eventually as long as there is some\n1:23:34 signal that you can extract even just\n1:23:37 telling the model you know here's some\n1:23:39 feedback think again about what you did\n1:23:41 give me an better answer will push you\n1:23:43 forward um and but I'll say that this is\n1:23:47 not really about the objective it's just\n1:23:48 because as we mentioned\n1:23:50 CR and descent learning likes a lot of\n1:23:52 example likes coverage it's help for\n1:23:54 generalization help for wider minimas\n1:23:57 stuff like that so it have to do a lot\n1:24:00 with the optimizer that we use\n1:24:02 >> yeah I would say I I would I would agree\n1:24:05 with that I think it's it becomes a\n1:24:08 little bit of a of a subtle question\n1:24:10 when you consider that the feedback that\n1:24:12 you get or that your student gets it it\n1:24:15 need not necessarily come from new data\n1:24:18 that you obtain from environment. So for\n1:24:20 example in Edun setting when you have\n1:24:22 expert demonstrations\n1:24:24 you could like solutions like ground\n1:24:26 truth solutions which may be very\n1:24:28 detailed you could have your model\n1:24:30 sequentially generate attempts train\n1:24:33 with the feedback then generate attempts\n1:24:36 again which may be a bit better but not\n1:24:38 fully solved it yet and then get\n1:24:39 feedback again and so these are these\n1:24:41 are examples I think where right now you\n1:24:44 would still prefer to get feedback like\n1:24:47 a lot of times Right. But the feedback\n1:24:51 does not require in that context another\n1:24:53 data point.\n1:24:54 >> True. True. Oh,\n1:24:56 >> it remind me of how like um u I was\n1:25:00 learning like um I don't know verifi\n1:25:03 verifiable\n1:25:04 uh element like math or physic like I\n1:25:07 had like this kind of red green yellow\n1:25:09 method where I would just try the\n1:25:12 exercise and if like I completely messed\n1:25:15 it up I would put the red there but I\n1:25:17 will still look at the answer try to\n1:25:19 figure my way out right and then I I'll\n1:25:21 just move on and if like I I kind of\n1:25:24 didn't get did it right. But like I knew\n1:25:26 exactly why I put the yellow there. But\n1:25:28 then all the green ones that I got first\n1:25:31 try, I never do them again. Like I got\n1:25:32 them. So what's the point, right? But\n1:25:34 then I just do iterative passes on like\n1:25:37 on the yellows and then iterative passes\n1:25:40 on the red and I still got like good\n1:25:42 learning signal out of it. And at some\n1:25:44 point there's like the three last one\n1:25:46 that are still kind of all messed up. Uh\n1:25:49 and in this specific case I know that\n1:25:51 like I need to get better feedback. So\n1:25:53 then went get a better feedback from the\n1:25:55 future whatever. Um but I feel like it's\n1:25:58 kind of the same situation where like\n1:26:00 yeah multiple I got multiple exampler of\n1:26:03 the same stuff right but with different\n1:26:05 understanding right and while while you\n1:26:07 go through like the the rest of the\n1:26:09 training data set and then you come back\n1:26:11 to those that you messed up um you're\n1:26:13 not going to like with the feedback will\n1:26:15 be different because you're going to do\n1:26:16 it a bit differently. you can learn\n1:26:18 something in this u specific uh uh\n1:26:21 exercise that will help you in those\n1:26:23 that you're struggling. Um I feel it's\n1:26:26 like a similar kind of vibe uh on the on\n1:26:29 this\n1:26:30 >> and it also reminded me of how I how I\n1:26:32 study for exams for sure.\n1:26:34 >> [laughter]\n1:26:35 >> Um last question. Um I I want you to\n1:26:39 think hard here. Like do you think\n1:26:41 there's a setting where like self\n1:26:44 distillation will be um kind of strictly\n1:26:48 poorer than like the standard RLV or\n1:26:51 GRPU? Like do you think that there's a\n1:26:53 setting here that like salation will\n1:26:56 struggle compared to GRPO?\n1:26:58 >> One setting that is kind of obvious is\n1:27:00 when you have weak models, right? If\n1:27:02 you're trying to improve 350 million\n1:27:05 parameter model with a very weak\n1:27:08 >> in context learning capabilities, it\n1:27:10 will not work the best. I'll say that\n1:27:11 like um we tried um here in MIT to apply\n1:27:16 self distillation to robotics to vision\n1:27:19 language action models DLAs and it just\n1:27:22 doesn't work like this models does not\n1:27:24 have in context learning capabilities\n1:27:26 strong enough um to actually learn while\n1:27:30 you know JPO even simple reinforce can\n1:27:33 improve these models quite a lot so that\n1:27:35 I'll say the the obvious uh answer um\n1:27:39 more than that I can say that there are\n1:27:42 cases where I don't know if it will do\n1:27:44 poor but it just not really needed if\n1:27:46 you just want to kind of sharpen the\n1:27:48 distribution you know like whatever you\n1:27:51 want to learn is already kind of like\n1:27:54 >> inside your um distribution you get\n1:27:56 let's say always like around 50 60%\n1:27:59 success rate and you just want to push\n1:28:01 the 50 to the 100 on a you know inside a\n1:28:04 group like you sample a group for a\n1:28:06 prompt you get out of like you know your\n1:28:09 like however many like responses you get\n1:28:12 most of them correct and you just want\n1:28:13 to make sure that you don't let's make\n1:28:15 sure all of them will be correct then\n1:28:17 self distillation is not really like\n1:28:20 maybe it will work but I'll not say like\n1:28:22 avoid the complication you know right\n1:28:24 >> pure GPO um because then there is\n1:28:28 nothing to gain from like you know some\n1:28:30 external knowledge additional knowledge\n1:28:32 like really good credit assignment stuff\n1:28:35 like that that where self distillation\n1:28:37 really shines will not help you that's\n1:28:38 not what you Yeah, I maybe to pick up on\n1:28:41 that point. Um I think like normally in\n1:28:46 RL like there this is trade-off between\n1:28:49 methods that are unbiased but exhibit a\n1:28:51 lot of variance and methods that are\n1:28:54 maybe biased but have much smaller\n1:28:55 variance. And in many ways you can think\n1:28:59 um of self-dislation in that lineage of\n1:29:02 of kind of a method that may be a bit\n1:29:05 biased depending on how good your models\n1:29:08 and context learning ability is etc but\n1:29:11 which has much less variance in its\n1:29:14 update signal that of course is a\n1:29:16 trade-off right and um I think it\n1:29:20 depends often on like which one is\n1:29:24 better depends on whether you're comput\n1:29:25 bound whether you're data bound\n1:29:27 It depends on how much compute you have\n1:29:30 available in total. And so right now as\n1:29:33 Edan said if you have a if you have a\n1:29:36 like a large pipeline and you have\n1:29:39 unlimited compute and you have your data\n1:29:41 environment set up that give you\n1:29:43 like 50% success rate in every\n1:29:46 [clears throat] group. you should run\n1:29:47 you should run an unbiased\n1:29:50 method probably and that will be tends\n1:29:52 to be more stable right\n1:29:54 >> long term I don't it's it's very hard to\n1:29:58 predict what will happen\n1:30:02 >> okay that makes a lot\n1:30:03 >> but yeah I I think that the things which\n1:30:06 like we're or I personally at least I\n1:30:09 can I can say that I'm most most excited\n1:30:12 about are the settings where there isn't\n1:30:15 even a method that can learn from that\n1:30:17 type of data effectively.\n1:30:19 >> So do you give an example? Um\n1:30:22 >> yeah, for example, learning from\n1:30:23 userations\n1:30:25 or like learning learning in this kind\n1:30:27 of online fashion in like open claw RL\n1:30:30 like where you just interact with your\n1:30:31 environment in your environment returns\n1:30:32 you whatever\n1:30:34 >> and\n1:30:35 >> because then you you increase massively\n1:30:37 the amount of uh data that you can shove\n1:30:40 into the model and still make it learn\n1:30:43 in a stable fashion.\n1:30:44 >> Um agree. Yeah.\n1:30:46 >> Yeah.\n1:30:46 >> And if there one thing that we learned\n1:30:48 over the past two decades in deep\n1:30:50 learning that is all about scale, you\n1:30:52 know,\n1:30:52 >> find ways to shove more and more data\n1:30:55 into your model.\n1:30:56 >> Cool. Fantastic guys. Like this was\n1:30:58 absolutely awesome. Uh really thankful\n1:31:00 for you to have taken the time to come\n1:31:02 and answer all of these questions. And\n1:31:05 that's it for today folks. Uh if you're\n1:31:06 interested in diving more into this\n1:31:08 whole method of self distillation, check\n1:31:10 out the links in the description. I try\n1:31:12 to put as much paper that are relevant\n1:31:14 as possible. And if you have any\n1:31:16 question, don't hesitate to shoot them\n1:31:17 in the comment."
}