Speaker 100:00 - 00:26
You need to reach this level of reliability to really make any of these AI tools very useful, and I think we just crossed that probably December last year, at least at OpenAI. Now we can trust these models to do a lot of the work that we are doing. The last few months have been pretty wild. We moved from like competitions to usefulness to users and that's what we are feeling right now. I think most of the time the Bionic is the last mile. Speaker 100:00 - 00:26
你必须达到这种可靠性水平,才能真正让这些 AI 工具变得非常有用,而我认为我们大概在去年 12 月跨过了这条线,至少在 OpenAI 是这样。现在我们可以信任这些模型去完成我们正在做的大量工作。过去几个月相当疯狂。我们已经从那种竞赛式推进转向了对用户真正有用,而这正是我们当下的感受。我认为大多数时候,Bionic 才是最后一公里。
Speaker 100:26 - 00:34
There will always be a lot of space left for this last mile in different verticals. And I would highly encourage people to continue working on that. Speaker 100:26 - 00:34
在不同的垂直领域(verticals)里,这个“最后一公里”始终都会留下很大的发挥空间。我也会非常鼓励大家继续在这方面努力。
Speaker 200:34 - 01:03
Hi, I'm Matt Turk. Welcome to the MAD Podcast. My guest today is Jan Dubois, who co leads the post training Frontiers team at OpenAI. The recent release of GBT 5.5 was yet another major milestone in AI, and Jan's team helped build it alongside OpenAI's prior top reasoning models, including o three and GBT five thinking. Before OpenAI, Jan was at Stanford, where he co authored Stanford Alpaca, the landmark project that kicked off much of the modern post training research community. Speaker 200:34 - 01:03
大家好,我是 Matt Turk。欢迎来到 MAD Podcast。今天的嘉宾是 Jan Dubois,他共同领导 OpenAI 的 post training Frontiers 团队。最近发布的 GBT 5.5 是 AI 领域又一个重要的里程碑,Jan 的团队参与打造了它,也参与了 OpenAI 之前一些顶尖推理模型的构建,包括 o three 和 GBT five thinking。在加入 OpenAI 之前,Jan 在 Stanford 工作,他在那里共同发表了 Stanford Alpaca——这个里程碑式项目开启了现代 post training 研究社区的很大一部分。
Speaker 201:03 - 01:29
In this conversation, we go deep on what's actually new in GPT 5.5, why reinforcement learning is moving from math and coding competitions into messy real world work, why AI progress can feel like a sudden step function, and why continual learning remains one of the big unsolved problems in AI three years after ChatGPT. Please enjoy this fantastic conversation with Jan Dubois. Hey, Jan. Welcome. Speaker 201:03 - 01:29
在这次对话中,我们会深入讨论 GPT 5.5 到底有哪些真正的新东西,为什么 reinforcement learning(强化学习)正在从数学和编程竞赛走向混乱的现实世界工作,为什么 AI 的进步会让人感觉像突然出现的阶跃函数,以及为什么在 ChatGPT 发布三年之后,continual learning(持续学习)仍然是 AI 中尚未解决的大问题之一。请大家欣赏这场与 Jan Dubois 的精彩对话。嘿,Jan,欢迎。
Speaker 101:29 - 01:30
Hi, Matt. Thanks for having me. Speaker 101:29 - 01:30
你好,Matt。谢谢邀请我来。
Speaker 201:30 - 01:57
It's been another wild, last few weeks in the world of, Frontier AI with the release of, GBT 5.5, of, Claude Muthos preview. So it it feels like, we have unlocked yet another step function in progress, particularly in cybersecurity, agent decoding. What's the best way to think about this from your perspective? Are things accelerating? What is happening? Speaker 201:30 - 01:57
随着 GBT 5.5 和 Claude Muthos preview 的发布,Frontier AI 领域过去几周再次变得非常疯狂。所以感觉上,好像我们又解锁了一个新的进步阶跃,尤其是在 cybersecurity(网络安全)和 agent decoding 方面。从你的视角来看,理解这件事的最佳方式是什么?事情是在加速吗?到底发生了什么?
Speaker 101:57 - 02:31
Yeah. The last few months have been pretty wild. Internally, also really feel it, and I think anyone who's working with anyone who's work who's coding, basically, is really feeling it right now. I think that's really because of three reasons. The first one is even though in my mind, the progress is actually pretty continuous, you need to reach this level of reliability to really make any of these AI tools very useful, and I think we just crossed that probably December last year, at least at OpenAI. Speaker 101:57 - 02:31
是的,过去几个月确实非常疯狂。我们内部也真切感受到了这一点,而且我认为,任何和从事工作的人合作的人——基本上,任何在写代码的人——现在都非常能感受到这一点。我觉得这主要有三个原因。第一个是,尽管在我看来,进展其实是相当连续的,但你必须达到这种可靠性水平,才能真正让这些 AI 工具变得非常有用,而我认为我们大概在去年 12 月跨过了这条线,至少在 OpenAI 是这样。
Speaker 102:31 - 03:05
That's where I thought we really crossed that threshold, where now we can trust these models to do a lot of the work that we are doing. So it feels like a stem function, even though I think actually in terms of capability, it's pretty continuous. So that's the first thing. The second reason is once you start having models that are really good, you accelerate yourself, especially in terms of coding, given that we all code internally. You accelerate yourself both for having these models train the other models, but also build the tooling that we need as researchers to do our job. Speaker 102:31 - 03:05
那就是我认为我们真正跨过那个门槛的时候——现在我们可以信任这些模型去完成我们正在做的大量工作。所以它给人的感觉像是一个阶跃函数,尽管我认为,从能力角度来说,它其实是相当连续的。这是第一点。第二个原因是,一旦你开始拥有真正优秀的模型,你就会加速你自己,尤其是在 coding 方面,因为我们内部所有人都会写代码。你会以两种方式加速自己:一是让这些模型去训练其他模型,二是让它们构建我们作为研究人员开展工作所需要的工具。
Speaker 103:05 - 03:44
All this acceleration, I think, means that we saw these last few months going faster and faster. The third thing that I think we are feeling is all of last year, we really built these reasoning models, and we really pushing a lot on reinforcement learning. And initially, when we had like 'one, 'one preview, even 'three, these models were still optimized for what we call verifiable rewards, things where we actually have access to ground truth. It's easy to test whether you're correct or not. That is, for example, the case in math questions or coding competitions. Speaker 103:05 - 03:44
我认为,这一切加速都意味着,过去这几个月我们看到事情进展得越来越快。第三点是,我觉得我们切身感受到的是,去年一整年,我们确实构建了这些 reasoning models(推理模型),也确实在 reinforcement learning(强化学习)上投入了很多。最初,当我们有像 “one”、“one preview”,甚至 “three” 这些模型时,它们仍然是针对我们所说的 verifiable rewards(可验证奖励)来优化的,也就是那些我们实际能够获得 ground truth(真实答案)的任务。判断对错很容易测试。比如,math questions(数学题)或 coding competitions(编程竞赛)就是这种情况。
Speaker 103:45 - 04:13
And what I think we are realizing now is that we were able to take many of the tools that we built for these verifiable reward cases, and we were able to use them more generally for reinforcement learning on real use cases. And I think that's, like, really why we're feeling that right now in, like, just real world coding rather than, like, competition. So we moved from, like, competitions to usefulness to users, and that's what we are feeling right now. Speaker 103:45 - 04:13
而我认为我们现在正在意识到的是,我们已经能够把为这些 verifiable reward(可验证奖励)场景构建的许多工具,较为普遍地用于真实用例上的 reinforcement learning(强化学习)。而我觉得,这也正是为什么我们现在会在真实世界的 coding(编程)中,而不是在竞赛环境里,如此强烈地感受到这种变化。所以我们是从 competitions(竞赛)走向了对用户有用性,而这正是我们当下所感受到的。
Speaker 204:13 - 04:23
Okay. Fascinating. So we're going to unpack a lot of this, particularly on the the RL side. For the first thing that you mentioned, reliability, is that, engineering? Is that models? Speaker 204:13 - 04:23
好的,很有意思。所以我们接下来会把这些内容一一展开,尤其是 RL(强化学习)这一侧。先说你提到的第一点,reliability(可靠性),这是 engineering(工程)问题吗?还是 models(模型)问题?
Speaker 204:23 - 04:27
Like, what what makes a model reliable in in in the way you meant it? Speaker 204:23 - 04:27
也就是说,按你刚才的意思,一个模型为什么会是 reliable(可靠)的?
Speaker 104:27 - 04:55
It's a little bit of everything. But in general, given that these are agentic models, the longer if you just think about it as, like, every two minutes, there's, a certain probability that they are wrong. The longer that they run, the the higher the probability that, like, the final answer is gonna be wrong. So it's just something inherent in agentic models. And what we've been pushing a lot on is making sure that the model we decrease this probability of being wrong every two minutes. Speaker 104:27 - 04:55
其实两方面都有一点。但总体来说,既然这些是 agentic models(agent 型模型),如果你把它简单理解为:每过两分钟,它就有一定概率出错,那么它运行得越久,最终答案出错的概率就越高。所以这是 agentic models(agent 型模型)本身固有的一个问题。而我们一直在大力推进的,就是确保把这种“每两分钟出错一次”的概率降下来。
Speaker 104:55 - 05:11
So purely from a model point of view, of course, there's a lot of reliability that is also happening on the applied side, And the team at OpenAI has been doing an amazing job on that. But I'm even talking only about reliability of our models and making sure that basically we decrease the probability of being wrong. Speaker 104:55 - 05:11
所以,单从模型角度来看,当然,很多 reliability(可靠性)相关的工作也发生在 applied side(应用侧),而 OpenAI 的团队在这方面做得非常出色。但我这里甚至只是在谈我们模型本身的可靠性,也就是确保我们能把出错概率实质性地降下来。
Speaker 205:11 - 05:28
Great. So 5.5, formerly known as Spud was, as mentioned, a big deal. It is a big deal. And I'm just curious from the inside, what was what are you guys the most proud of? What did you find the most challenging? Speaker 205:11 - 05:28
很好。所以 5.5,也就是之前叫 Spud 的那个模型,正如前面提到的,是件大事。它现在也是件大事。我只是很好奇,从你们内部来看,你们最自豪的是什么?你们觉得最有挑战的又是什么?
Speaker 205:28 - 05:34
Give us some some some color on, how you all felt releasing this. Speaker 205:28 - 05:34
也跟我们多讲一点你们在发布它时的真实感受。
Speaker 105:34 - 06:00
We're all really excited about 5.5, to be honest. It is one of these models where everyone in the company was extremely involved in building, and I think that we really feel it now. That's like we got a lot of attention because of the 5.5, and it's it seems like all the stars were aligned. That doesn't always happen, and I was just like a great model for this. So we did feel it. Speaker 105:34 - 06:00
说实话,我们都对 5.5 真的非常兴奋。它属于那种公司里几乎所有人都深度参与构建的模型,我觉得我们现在确实能感受到这一点。也就是说,5.5 给我们带来了很多关注,看起来就像一切时机都刚刚好。这样的情况并不总会发生,而这次它恰好就是一个非常出色的模型。所以我们确实有这种感受。
Speaker 106:00 - 06:43
It's kind of funny because in general with every model that is looking really good early on, we have a model, we all get really excited about it, and then there's, like, tons of doubts that start, coming up because it's like, oh, like, everyone is so hype is, like, hyping this thing internally, but actually it's, like, bad at all these other things, and then there's another wave where, like, people start, under hyping it, and it kind of goes through waves. And it depends when we actually ship it, how people feel about it internally. But that's true with most models that we have. So 5.5 was not that different in this case, but it definitely maybe had a higher amplitude of the wave. So people were very excited, then very not as excited, and and we shipped it, and and people were happy externally. Speaker 106:00 - 06:43
这其实挺有意思的,因为一般来说,每当一个模型在早期看起来非常不错时,我们都会有一个模型出现,大家先是非常兴奋,然后就会冒出大量怀疑,因为会觉得,哦,大家在内部把这东西炒得很热,但它其实在很多别的方面表现并不好;接着又会出现另一波情绪,人们开始反过来低估它。整个过程会一波一波地起伏。而我们最终什么时候 ship(发布)它,也会影响内部的人当时对它的感受。不过我们大多数模型基本都会经历这个过程。所以 5.5 在这点上并没有特别不同,只不过它的波动幅度可能确实更大一些。也就是说,人们先是非常兴奋,然后又没那么兴奋,最后我们把它发布了,而外部的反馈是大家都很满意。
Speaker 206:43 - 06:56
How long does that process take? Like, you know, you including the waves of going up and down and of of of excitement. I guess it depends on the on the on the release and the importance of each release, but, like, is that a is that a few weeks? Is it a few months? Speaker 206:43 - 06:56
这个过程一般要多久?我是说,包括这种兴奋情绪上下波动的过程。我猜这取决于具体的 release(发布)以及每次发布的重要程度,但大概是几周,还是几个月?
Speaker 106:56 - 07:33
It really depends. I can so I cannot talk exactly about what what went into 5.5, but it kind it kind of depends which part of the pipeline is training parts of the model. We really have different sub teams, including pre training, and you have the mid training stage, and you have some post training, and usually the closer you get to products, like posting being the last one, the faster the iteration cycle is. And if you're more upstream, the slower the iteration cycle is. So it could go from, let's say, from months to days, basically. Speaker 106:56 - 07:33
这确实要看情况。所以我不能具体谈 5.5 到底经历了什么,但这在某种程度上取决于 pipeline(流程)里的哪一部分在训练模型的哪些部分。我们内部确实有不同的子团队,包括 pre-training(预训练)、mid-training(中期训练)阶段,以及 post-training(后训练)。通常来说,越接近产品端,比如 post-training 这种最后一环,迭代周期就越快;越是上游,迭代周期就越慢。所以整体上讲,时间跨度基本可能从几个月到几天不等。
Speaker 207:33 - 07:48
The 5.5 was particularly good, on agent decoding, computer use, knowledge work, and early scientific research. How does that work in internally? Do do different people focus on those different parts? How do you get to that result? Speaker 207:33 - 07:48
5.5 在 agent decoding、computer use、knowledge work,以及早期 scientific research 方面尤其强。这在内部是怎么实现的?是不是不同的人分别专注于这些不同部分?你们是怎么得到这个结果的?
Speaker 107:48 - 08:15
Yeah. We definitely have different teams that are working on specific use cases and are pushing on these use cases. My team specifically is actually the one that is kind of taking all these vertical improvements and try to put them together in the final model. You could see it as a team that is doing both kind of the smoothing function. So you have all these improvements, but you need to make sure that the model doesn't feel too spiky, doesn't feel differently in different on on different verticals. Speaker 107:48 - 08:15
对,我们确实有不同的团队在针对特定 use case(使用场景)工作,并持续推进这些 use case。具体来说,我的团队其实更像是把这些纵向能力提升汇总起来,再尝试把它们整合进最终模型的团队。你可以把它理解为同时承担某种“平滑函数”的角色:你有很多改进成果,但你需要确保模型不会显得过于尖刺化,不会在不同 vertical(垂直领域)之间呈现出明显不同的体验。
Speaker 108:15 - 08:48
And also you need to have some teams that are working, and that's basically what my team is doing, on all the horizontal improvements. So there are many things like instruction following, function calling, or like thinking about how much should a model think for on different, problems. Those are very horizontal and that kind of impacts all these use cases. So we have both these more vertical teams and these more horizontal ones, and both are very important to improve on the model. And the good thing is that these things can kind of be improved orthogonally. Speaker 108:15 - 08:48
此外,你还需要一些团队去做那些横向改进,而这基本上就是我团队在做的事。这里面有很多内容,比如 instruction following(指令遵循)、function calling(函数调用),或者思考模型在不同问题上应该“思考”多少。这些都非常 horizontal(横向),而且会影响到所有这些 use case。所以我们既有更偏 vertical 的团队,也有更偏 horizontal 的团队,而这两类团队对于改进模型都非常重要。好的一点是,这些方面基本上可以相对正交地改进。
Speaker 108:48 - 09:28
So you might have like multiple different teams that are working on certain verticals and maybe for one model, there's only half of these teams that made integrations basically in the last run and like improved the model on these capabilities and maybe for the next model, it'll be the other half. So that's kind of at a high level how it works. One thing which I will say, because you asked also about one of the things that we are really proud about for this model, I would say two things. Number one is the efficiency of the model. We really, really improved the efficiency of the model and like we, most of the tasks can be basically performed, I would say like two X faster now with this model. Speaker 108:48 - 09:28
所以你可能会有多个不同团队分别在做某些 vertical,而对于某一个模型来说,也许这些团队里只有一半在最后一轮真正完成了 integration(集成),从而提升了模型在这些能力上的表现;到了下一个模型,也许起作用的就是另一半。这大致就是它在高层面上的运作方式。还有一点我想说,因为你也问到了这个模型有哪些让我们特别自豪的地方,我会说有两点。第一点是模型的效率。我们确实、确实把模型效率提升了很多,而且用这个模型,现在大多数任务基本上都可以做到大约快 2X。
Speaker 109:29 - 09:49
So that's great. And the other one that I already mentioned before, but it's kind of this alignment of the company and making sure that like everyone is working towards the same goal. And that really takes the entire company working towards like this north star of building one good model in in like specific timelines. So very, very proud of how that happened. Speaker 109:29 - 09:49
所以这很棒。另一个我之前已经提到过的点,是公司的这种对齐(alignment),要确保每个人都朝着同一个目标努力。而这确实需要整个公司都围绕着这样一个北极星(north star)来工作:在特定时间线内做出一个好的模型。我对这件事最终是如何实现的,感到非常、非常自豪。
Speaker 209:49 - 10:02
Great. And then speaking of efficiency, how do you optimize for that? We're talking about efficiency per per token. Are we also talking about latency in serving the model? What what what part is AI research versus engineering? Speaker 209:49 - 10:02
很好。那说到效率,你们是怎么优化这个指标的?我们谈的是每个 token 的效率。那我们也在谈模型服务时的延迟(latency)吗?AI 研究和工程各自负责的是哪一部分?
Speaker 110:02 - 10:27
So that's what that's what I I mean when I say it's the entire company is that it really comes from everywhere. It has to come from, like, inference optimizations. It has to come from the model of being more efficient in its thinking time. So you have basically every token that you think for. Basically, the the usual plot that you should be looking at is x axis, the number of tokens that you think for, and y axis, the the performance. Speaker 110:02 - 10:27
所以,这就是我说“这是整个公司的事情”时的意思:它真的来自各个方面。它必须来自推理(inference)优化,也必须来自模型本身,让它在“思考时间”上更高效。所以基本上,你每多思考一个 token,通常你应该看的那张图就是:x 轴是你用于思考的 token 数量,y 轴是性能。
Speaker 110:27 - 11:03
So this is the these test time scaling curves that we look at. And research basically tries to move this curve to the left, so think less, to be the same level or more correct. And then inference also deals with with this x axis, but switches switches it from number of tokens to actual latency. And the the final thing that people care about is latency on x axis, performance on y axis, and this is where everything comes together, and this is really what happened with 5.5. So, yeah, that's why I always say I'm really proud of the company for this one. Speaker 110:27 - 11:03
所以这就是我们看的那些测试时扩展(test time scaling)曲线。研究的工作,本质上是在把这条曲线往左移——也就是思考得更少,但达到同样甚至更高的正确性。而推理(inference)也处理这个 x 轴,只不过它会把“token 数量”转换成实际延迟(latency)。最终大家真正关心的是:x 轴是延迟,y 轴是性能,而这正是一切汇合到一起的地方,这也正是 5.5 身上真正发生的事。所以,是的,这就是为什么我总说,我对公司在这件事上的表现真的非常自豪。
Speaker 211:03 - 11:11
Okay. Great. Let's talk about you for for for a minute. So you are in the post trading frontiers team. So that team you described as horizontals. Speaker 211:03 - 11:11
好的,很好。那我们来聊聊你本人一分钟。你所在的是 post training frontiers team。你把这个团队描述为一个 horizontal(横向)团队。
Speaker 211:11 - 11:14
What does the the team do in in general? Speaker 211:11 - 11:14
这个团队总体上是做什么的?
Speaker 111:14 - 11:30
Yeah. I would say there's three things that we do. So in a broad broad sense, we are on the post trading org, and my team is the post trading, frontiers one. So there are three things that my team does. Number one is we kind of decide what goes into the final run. Speaker 111:14 - 11:30
是的。我会说我们做三件事。从更宏观的角度来说,我们属于 post training org,而我的团队是其中的 post training frontiers team。所以,我的团队做三件事。第一件事是,我们会决定哪些内容进入最终的 run(训练运行)。
Speaker 111:30 - 12:06
So as we talked before, there's like many verticals, and someone needs to decide, like, what can go in, what cannot, and also provide the science experiments for people to iterate on something that is gonna be representative of the final run. So this is the first thing that my team does. The second thing that my team does is bringing everything together and actually doing the big run. So this has, as you might imagine, like we train on a good amount of GPUs, so there's a lot of infra work that is needed, but also there's a lot of ML work that is needed by putting everything together and making sure things work well together. And then the third thing that my team does is horizontal improvements to the models. Speaker 111:30 - 12:06
就像我们之前谈到的,有很多 verticals(纵向方向),而必须有人来决定什么可以放进去、什么不能放进去,同时还要为大家提供科学实验,让他们可以在某个能够代表最终 run 的对象上进行迭代。这是我团队做的第一件事。第二件事是把所有东西整合起来,并且真正执行那次大型 run。正如你可以想象的,我们会在相当数量的 GPUs 上训练,所以这里既需要大量 infra(基础设施)工作,也需要大量 ML(机器学习)工作,把所有东西拼在一起,并确保它们能够很好地协同工作。然后,我团队做的第三件事,是对模型进行横向改进。
Speaker 112:06 - 12:31
Basically, are some things that, like, these vertical scenes will not usually look too much at. For example, the thinking time, as I said before. So how much should the model think for on certain answers? Or instruction following, function calling, things like memory, and general improvements to the model that are really across the stack. So that's what the Pushing Frontiers team does, I'm leading that team. Speaker 112:06 - 12:31
基本上,有些事情这些垂直场景通常不会太多关注。比如我之前说过的思考时间。也就是模型在某些回答上应该思考多久?或者 instruction following(指令遵循)、function calling(函数调用)、memory(记忆)这类东西,以及那些真正贯穿整个技术栈的通用模型改进。所以这就是 Pushing Frontiers team 在做的事,我在带这个团队。
Speaker 212:31 - 12:35
Okay. Great. And, what was your journey into OpenAI? Speaker 212:31 - 12:35
好的,很棒。那么,你是怎么来到 OpenAI 的?
Speaker 112:35 - 13:00
Oh, it's a long story, but I'll try to keep it really short. Basically, I did my undergrad in biomedical engineering, in Switzerland. I'm from Switzerland. And then I won an exchange in Canada, and I learned about word2vec. So I don't know if you heard about this algorithm, but it basically takes words, which is like a a something discrete, and puts it in a in a vector space. Speaker 112:35 - 13:00
哦,这是个很长的故事,不过我尽量说得很短。基本上,我在 Switzerland 读的是 biomedical engineering(生物医学工程)本科,我来自 Switzerland。后来我赢得了一次去 Canada 的交换机会,并在那里了解到 word2vec。我不知道你有没有听过这个算法,但它基本上是把词语——也就是某种离散的东西——放进一个 vector space(向量空间)里。
Speaker 113:01 - 13:30
So puts it basically in a in a way to think about it as a plane where if words that are more similar to one another will be closer to one another. So it it brings these, like, discrete words into, like, some continuous space that is semantically meaningful. I was absolutely blown away by that algorithm. And that's when I decided that I wanted to work on natural language processing and just understanding language. At that time, I was very wrong, but I thought that English NLP was basically solved, well, like close to being solved. Speaker 113:01 - 13:30
所以,它相当于把词放进一个平面里来理解:如果词语彼此更相似,它们之间的距离就会更近。也就是说,它把这些离散的词带入某种在语义上有意义的连续空间。这个算法当时让我震撼不已。也正是在那时,我决定自己想做 natural language processing(自然语言处理),想研究语言理解。那时候我的判断其实非常错,但我当时以为 English NLP 基本已经解决了,至少也接近解决了。
Speaker 113:30 - 14:04
That was in 2017. So that was right when Transformers started, and it was actually right before Transformers. So I was very wrong, but I decided that I wanted to work on under researched languages. And basically, I wanted to improve NLP on languages where we don't have that much data. So I went to work for Grab in Singapore, and I was basically building the natural language processing pipeline for them, working with Khmer, with Bahasa, with Thai, Vietnamese, and all these different languages. Speaker 113:30 - 14:04
那是在 2017 年。所以那正好是 Transformers 刚开始的时候,甚至其实还是在 Transformers 之前。所以我当时非常错,但我还是决定要去做那些研究不足的语言。基本上,我想提升那些数据并不多的语言上的 NLP。于是我去了 Singapore 的 Grab 工作,基本上是在为他们搭建 natural language processing pipeline(自然语言处理流水线),处理 Khmer、Bahasa、Thai、Vietnamese,以及各种不同的语言。
Speaker 114:04 - 14:18
And then I'm skipping a little bit. I had I did more academic type of work in different countries, and I ended up at Stanford, did my PhD there. And after this, had a small stint into startups and then, went to OpenA. Speaker 114:04 - 14:18
然后我稍微跳过一点。后来我又在不同国家做了一些更偏学术类型的工作,最后去了 Stanford,在那里读了 PhD。再之后,我短暂做过一段 startup 相关的工作,然后就去了 OpenA。
Speaker 214:18 - 14:30
Yes. And I I remember, seeing on your blog or your page a note for quant firms to not reach out to you because you were not interested in hedge fund work. Speaker 214:18 - 14:30
是的。我记得我在你的 blog 或个人页面上看到过一条说明,说让 quant firms 不要联系你,因为你对 hedge fund 相关工作不感兴趣。
Speaker 114:31 - 14:43
Yeah. I always think it's very important for me to think about the positive impact that I'm having in the world or at least that I'm trying to have. Yes. So so that's that's why this thought is there. Speaker 114:31 - 14:43
对。我一直觉得,对我来说,思考自己正在给这个世界带来什么样的积极影响,或者至少思考自己试图带来什么样的积极影响,是非常重要的。对。所以这就是我会写下那个想法的原因。
Speaker 214:43 - 15:05
Yes. And as we were, saying just before we started recording, people may have seen you in the GBT five video announcement, and you did this very funny demonstration of an app that was built on the fly to teach your partner how to speak French. So, like, people should go check that out. Speaker 214:43 - 15:05
对。就像我们在开始录制前刚刚说的那样,大家可能已经在 GBT five 的视频发布公告里见过你了;你当时做了一个特别有趣的演示:现场临时搭了一个 app,用来教你的伴侣说法语。所以,大家应该去看看那个。
Speaker 115:06 - 15:17
Exactly. That was that was a fun one. That was a fun one. It was GPT five was not that reliable, I was a little bit stressed that it wouldn't work, but but it ended up working. Speaker 115:06 - 15:17
没错。那个很有意思,真的很有意思。GPT five 当时其实没那么可靠,我还有点紧张,怕它跑不起来,不过最后还是成功了。
Speaker 215:17 - 15:19
So this was truly live and and Speaker 215:17 - 15:19
所以那次确实是真正的 live(现场),而且——
Speaker 115:19 - 15:20
Okay. Was it was truly Speaker 115:19 - 15:20
好吧。它确实是——
Speaker 215:21 - 15:23
very rehearsed, but but truly live. Speaker 215:21 - 15:23
经过了很多排练,但也确实是真正的 live(现场)。
Speaker 115:24 - 15:37
Actually, the right before we did that, like, the last rehearsal, it did not work. So I got slightly stressed about that, but but, yeah, seems like live live ended up working well. Speaker 115:24 - 15:37
实际上,就在我们上场前、最后一次 rehearsal(彩排)的时候,它是没成功的。所以我当时有点紧张,不过,嗯,最后真正 live(现场)的时候效果还是很好。
Speaker 215:37 - 15:42
Yeah. The no no pressure. But, that that that landed perfectly. Okay. Very cool. Speaker 215:37 - 15:42
对。完全没有压力啊。不过,那个效果确实落得非常漂亮。好,太酷了。
Speaker 215:42 - 16:19
Alright. So let's some of the things we alluded to in the intro. So we started effectively talking about reasoning, and I'm curious what reasoning means in 2026 that's any different from, you know, a conversation we could have had about one or three. In particular, one of the claims of 5.5 and and also my experience as a user is that it's particularly good with with messy data, which seems to imply that it needs to reason through ambiguity more. What has changed? Speaker 215:42 - 16:19
好,那我们来聊聊我们在开场时提到的一些事情。我们其实已经开始谈 reasoning(推理)了,我很好奇,到了 2026 年,reasoning 到底意味着什么,和我们当年讨论 one 或 three 时相比,有什么不同。尤其是,关于 5.5 的一个说法——也是我作为用户的实际体验——是它在处理 messy data(杂乱数据)方面特别强,这似乎意味着它需要更多地在模糊性中进行推理。到底发生了什么变化?
Speaker 116:20 - 17:23
What I would say is that o one and o one preview were really, really breakthroughs in the research community about having model that can think, and the longer they're thoughtful, higher likelihood they would be of being correct. So that was really a breakthrough, but initially, and if you look at old blog posts, would mostly see math evals and also maybe coding competitions, but things that are really easy to test whether you're correct or whether you're not. And it also gives you like some suggestion about like how we were training some of these models, and how I see maybe all of last year and especially the end of last year and the beginning of this year is that we were able to take these algorithms that work with, verify rewards, things where we can say you're correct or you're not, to the messy real world really optimize for the utility that we provide to users and, like, making them more productive. So I think that's what really changed. Speaker 116:20 - 17:23
我想说的是,o one 和 o one preview 在研究社区里确实是非常非常重大的突破,因为它们让我们看到了模型可以“思考”,而且它们思考得越久,答对的概率通常就越高。所以那的确是一个突破。但在最初,如果你去看以前的 blog posts,看到的大多会是数学评测,也可能是编程竞赛这类任务——也就是那些非常容易检验对错的事情。这其实也能在某种程度上反映出我们当时是如何训练其中一些模型的。而我对去年,尤其是去年年底到今年年初这段时间的理解是:我们已经能够把这些原本适用于 verified rewards(可验证奖励)、也就是我们可以明确判断“你对了还是错了”的算法,带到混乱得多的现实世界中,真正去优化我们为用户提供的 utility(效用),比如让他们更高效、更有生产力。我认为这才是真正发生变化的地方。
Speaker 217:23 - 17:30
Okay. So it's the post training reinforcement learning part largely? Speaker 217:23 - 17:30
好的。所以主要是 post-training(后训练)里的 reinforcement learning(强化学习)这部分在起作用?
Speaker 117:30 - 18:20
Yeah. I would say that's a I mean, there's also there's also another big part of it. Number one, basically, the the first thing is that, of course, when you develop a new method, the method is kind of fragile and is not that reliable and it's hard to basically productionize, so this bot also improved a lot, but then it's also really basically, we had a tool that we could start optimizing for different things, And initially when we were developing this tool, making we a lot of simplifying assumptions up in the real world basically. Now we are removing these simplifying assumptions and at least in Poshstring, we are able to optimize really user utility and make sure that these models are useful and the tasks that we are looking at are useful. And that's why also now current evals look much more realistic. Speaker 117:30 - 18:20
对。我会说这当然是其中一个很大的部分——不过还有另一个重要部分。第一点是,当你开发一种新方法时,这种方法一开始通常比较脆弱,可靠性也没那么高,基本上很难真正 productionize(投入生产环境),所以这方面后来也改进了很多。但同时更关键的是,我们手上有了一个可以开始针对不同目标去优化的工具。最开始开发这个工具时,我们其实对现实世界做了很多简化假设。现在我们正在移除这些简化假设,至少在 post-training(后训练)阶段,我们已经能够真正去优化 user utility(用户效用),并确保这些模型是有用的,而且我们所关注的任务本身也确实有用。这也是为什么现在的 evals(评测)看起来现实得多。
Speaker 118:21 - 18:33
I mean, if you think about g t p val or even if you look at, like, three bench pro or three bench, these look way more realistic than, let's say, some code force or, like, coding competitions that we were looking at with r one. Speaker 118:21 - 18:33
我的意思是,如果你想想 g t p val,或者哪怕看看像 three bench pro 或 three bench 这样的东西,它们看起来都要比我们当时用 r one 去看的那些 code force 或编程竞赛之类的任务现实得多。
Speaker 218:33 - 18:51
Mhmm. Mhmm. And still on the topic of of reasoning, what what's ultimately the difference between 5.5, thinking versus 5.5 pro? Is that is that just more test time compute, more tokens, and more time invested in solving a a problem? Speaker 218:33 - 18:51
嗯,嗯。还是继续说 reasoning(推理)这个话题,5.5 thinking 和 5.5 pro 到底最终有什么区别?是不是就只是投入了更多 test-time compute(测试时算力)、更多 token,以及更多解题时间?
Speaker 118:51 - 19:33
Yes. Basically, it's just a question of of, how much test time compute we pour into the model or we pour into this entire, system that we're shipping. So we we've seen again and again the longer the model think for, the better answers we will get. The problem is that these curves that we're talking about are definitely not linear, and there's some plateauing effect, they kind of look logarithmic in some sense, or depending on which eval. So you can pull, like, two times more compute and actually only get, like, small performance gains. Speaker 118:51 - 19:33
对,基本上就是这个问题:我们往模型里,或者更准确地说,往我们正在交付的整个 system(系统)里,投入多少 test-time compute(测试时算力)。我们一次又一次看到,模型思考得越久,得到的答案通常就越好。问题在于,我们这里谈的这些曲线显然不是线性的,会有某种平台效应,从某种意义上看它们有点像对数曲线,当然也要看具体是哪种 eval(评测)。所以你可能多投入两倍 compute(算力),最后实际上只得到一点点性能提升。
Speaker 119:34 - 20:10
I personally don't use Pro that much because I really don't like wait I'm pretty impatient, so I don't like waiting for that long. And, and I know that the probability of being correct definitely improves, but it doesn't improve, like, enough for for me to use it. But there are some people who use Pro and who really love it, especially actually for academic research. And, I know especially a lot of mathematicians who are using it, and that's because they're kind of just have this in the background that is running for maybe one hour, two hours, and they don't really need to, like, iterate really quickly with the model. And Pro is really good for that. Speaker 119:34 - 20:10
我个人其实不怎么用 Pro,因为我真的很不喜欢等待——我挺没耐心的,所以不喜欢等那么久。而且我也知道,它答对的概率确实会提高,但还没有高到足以让我去用它。不过确实有一些人会用 Pro,而且非常喜欢它,尤其是在 academic research(学术研究)里。特别是我知道有很多数学家在用它,因为他们基本上可以把它放在后台跑上一两个小时,而不太需要和模型进行非常快速的迭代。对于这种场景,Pro 就非常合适。
Speaker 220:10 - 20:32
I'd love to reconcile this with, what you were mentioning about efficiency earlier per token. So is the idea that, you would be able to think longer, but also be more efficient, therefore solve the task better? Like, how do those the the the the time aspect and the efficiency aspect sort of interact? Speaker 220:10 - 20:32
我想把这点和你前面提到的 per-token(每个 token)效率联系起来理解。所以你的意思是,模型既可以思考更久,同时也可以更高效,因此把任务完成得更好吗?也就是说,这里面“时间”这个维度和“效率”这个维度,二者之间究竟是怎么相互作用的?
Speaker 120:32 - 21:22
Yes. So if you go back to, the plot that I was talking about, what I was thinking about, well, on the x axis we have latency and y axis we have performance, we're basically moving this curve when we say that we improve efficiency more and more to the left, so we're becoming more efficient or, like, we spend less time to achieve the same performance, but what Pro does is that it extends this curve. So it says like, I'm gonna think for much longer, but I will have a higher likelihood of being correct, but every iteration of the pro model also moves to the left, so it also becomes more and more efficient. The important part is there will always be tasks where you just want to maximize the probability of correctness and you don't really care about latency. For example, if start a job before going to sleep, I mean, the model has, eight hours. Speaker 120:32 - 21:22
是的。所以如果回到我刚才提到、我当时在想的那张图,横轴是 latency(延迟),纵轴是 performance(性能),那么当我们说不断提升 efficiency(效率)时,本质上是在把这条曲线不断向左移动,也就是我们变得更高效了,或者说,我们花更少的时间就能达到同样的性能。而 Pro 所做的,是把这条曲线向外延展。也就是说,它会“思考”更久,但它得出正确答案的概率也会更高。不过,pro model 的每一次迭代也都会把曲线继续向左推,所以它同样也会变得越来越高效。重要的是,总会存在一些任务,在这些任务里你只想尽可能最大化正确率,而根本不在乎 latency。比如说,如果我在睡觉前启动一个任务,我的意思是,model 有整整八个小时。
Speaker 121:22 - 21:28
Like, it should just think for as long as it, as it can. And this is what kind of ProBook gives you. Speaker 121:22 - 21:28
就像,它应该能想多久就想多久。而这某种程度上就是 ProBook 提供给你的东西。
Speaker 221:28 - 21:41
And, in layman's term, like, what what what does that mean practically? Or how does that work practically? If the model goes in the wrong direction, then it would interrupt itself earlier? Is that, is that one of the axes? Speaker 221:28 - 21:41
那么,用外行能懂的话来说,这在实际中到底意味着什么?或者说,实际运作起来是怎样的?如果 model 走错了方向,它会更早地打断自己吗?这是其中一个维度吗?
Speaker 121:42 - 21:46
So for the efficient Okay. So there's two things. Are are you asking for the efficiency? What does it mean? Speaker 121:42 - 21:46
好的,关于 efficiency 这块。这里其实有两件事。你是在问 efficiency 吗?它到底是什么意思?
Speaker 221:46 - 21:48
Yeah. The for the efficiency. Yeah. Yeah. Yeah. Speaker 221:46 - 21:48
对,问的就是 efficiency。对,对,对。
Speaker 221:48 - 21:54
Yes. For the largely for the efficiency, I'm like I'm I'm I'm just curious how reasoning gets more powerful. Speaker 221:48 - 21:54
是的,主要是关于 efficiency。我只是好奇,reasoning(推理)是怎么变得更强的。
Speaker 121:54 - 22:40
Yes. That that's a good question. Let me give you maybe a metaphor from humans. You have someone who's an expert in certain domain and you compare them to some undergrad that is starting in that domain, the undergrad doing that task will probably take might take one day, two days, and we'll have to think through a lot of the possibilities and investigate because it never did a certain problem, while someone who's an expert that field will usually just know what direction to take, and it will not spend the time on investigating 10 different directions because it knows that there's one that is more likely to be correct. So this is the type of efficiency that we're talking about. Speaker 121:54 - 22:40
是的,这是个好问题。我可以用人类作个比喻。你找一个在某个领域里的 expert(专家),再把他和一个刚进入这个领域的 undergrad(本科生)相比。那个 undergrad 去做这项任务时,很可能要花一天、两天,还得把很多种可能性都想一遍、调查一遍,因为他以前没做过这类问题;而一个在那个领域里的 expert,通常会直接知道该往哪个方向走,也不会花时间去研究 10 个不同方向,因为他知道其中有一个方向更可能是正确的。所以,这就是我们所说的那种 efficiency。
Speaker 122:40 - 23:10
It's basically models where we optimized more on real world problems. As a result, it was kind of trained to figure out with a higher likelihood which paths of reasoning are more likely to be correct. So this is a part on efficiency. There's also what you suggested is that part of it is the model knowing when it's going down the wrong path. But this is also something that we can that the model can be trained for with reinforcement learning. Speaker 122:40 - 23:10
本质上,这类 model 更多是围绕 real world problems(现实世界问题)做了优化。因此,它某种程度上被训练成了能以更高概率判断出哪些 reasoning paths(推理路径)更可能是正确的。这是一部分 efficiency。还有一部分就是你刚才提到的:model 知道自己什么时候走上了错误的路径。不过,这同样也是我们可以、也是 model 可以通过 reinforcement learning(强化学习)来训练获得的能力。
Speaker 123:10 - 23:22
It's like knowing, like, that seems like not a great path. Let me backtrack and let me go and test something else. And if you train the model less, it might realize that it's in the wrong path much later. Speaker 123:10 - 23:22
这就像是知道,嗯,这条路看起来不太好。那我就回退一下,去测试别的东西。如果你对模型训练得更少,它可能要到更晚才会意识到自己走错了路。
Speaker 223:22 - 23:43
Okay. All right. So it seems like a lot of this goes back to reinforcement learning and post training. So let's talk about how the different components of modern AI systems work. So let's talk about pre training, mid training, and then post training and spend more time on post training since it's so important. Speaker 223:22 - 23:43
好,明白了。所以看起来这里很多东西都要回到 reinforcement learning(强化学习)和 post training(后训练)。那我们来谈谈现代 AI 系统的不同组成部分是如何工作的。我们来说说 pre training(预训练)、mid training(中期训练),然后是 post training,并且在 post training 上多花一些时间,因为它非常重要。
Speaker 223:44 - 24:19
Starting with pre training first at a high level and realizing that you may or may not be able to talk about how the things done or what happened in the context of 5.5 specifically. Big narrative of last year was that pre training was hitting a wall and was not going to yield much progress. That seems to not be the case at all in 2026. Can you walk us through some ideas for what is happening in pre training and why it's progressing now in a way that people hadn't predicted last year? Speaker 223:44 - 24:19
先从 pre training 的高层面开始,同时也考虑到你可能未必能谈太多关于这些事情具体是怎么做的,或者 5.5 这个特定背景下到底发生了什么。去年的一个主流叙事是,pre training 已经撞墙了,不会再带来太多进展。但到了 2026 年,这看起来完全不是事实。你能不能给我们梳理一下,pre training 里到底发生了什么,为什么它现在还能继续推进,而且是以一种去年大家没有预料到的方式在进步?
Speaker 124:19 - 25:28
For pre training, I can't talk in a lot of details about what's happening internally. Besides that, the team has been really doing a lot of good work, and our models are really getting better and better. One thing that I do want to highlight when we're talking, for example, with efficiency, If you have larger models, the amount of thinking time, so the amount of tokens they will think for, will usually decrease. And the way that you can think about it is that metaphorically, the model already thinks through its weights when it generates a certain token, so you can decrease the number of tokens that it needs to generate for thinking by increasing the size of the model that you are training. Oftentimes, if you just increase the model size, if you basically pre train larger models, you will get better efficiency, and the good thing with larger models is that they can be paralyzed better on at inference time. Speaker 124:19 - 25:28
关于 pre training,我不能非常详细地谈内部到底在发生什么。除此之外,团队确实一直在做很多很好的工作,我们的模型也确实在变得越来越好。比如说,在讨论 efficiency(效率)时,我想强调一点:如果你有更大的模型,它所需要的 thinking time(思考时间),也就是它在思考时会用掉的 tokens,通常会减少。你可以这样理解:打个比方,模型在生成某个 token 时,其实已经通过它的 weights 在“思考”了,所以通过增大你训练的模型规模,你就可以减少它为了思考而需要生成的 tokens 数量。很多时候,如果你只是增大模型规模,也就是预训练更大的模型,你会获得更好的效率;而且更大模型的好处是,它们在 inference(推理)时通常也能被更好地并行化。
Speaker 125:28 - 26:21
So the even though you might think, you actually generated fewer tokens, but by a larger model, so you actually might decrease the overall efficiency of the system. This is not true because the larger the model is, the more chances you have to actually optimize optimize busy for inference on GPUs, so you will be able to make the overall system more efficient. So that's one thing I wanted to say with larger models that are actually giving you a lot of efficiency. Otherwise, in terms of pre training, I think it's very interesting. I actually also thought maybe two years ago that pre training was kind of hitting a wall, and when we see, for example, if we talk just about Entropic, I mean, Mythos seems like clearly just a much bigger model when you look at the cost. Speaker 125:28 - 26:21
所以尽管你可能会觉得,确实,生成的 tokens 更少了,但模型更大了,因此系统整体效率可能反而下降。事实并非如此,因为模型越大,你越有机会真正把 inference 在 GPUs 上优化到更高效,所以你实际上能够让整个系统变得更高效。这就是我想说的一点:更大的模型其实会带来很多效率收益。除此之外,就 pre training 而言,我觉得这件事非常有意思。其实我自己大概两年前也觉得 pre training 有点撞墙了;而且比如说,如果我们只谈 Entropic,我是说,Mythos 从成本上看显然就是一个大得多的模型。
Speaker 126:23 - 26:49
The cost of the model, usually that's how you know, by the way. It's a bigger model, you just look at the cost per token. And clearly, they are getting very good performance just by increasing the size of the model. So I think the field was very at least part of the field was surprised about that. There were a lot of conversations about hitting data walls, and it seems like we did not quite hit it. Speaker 126:23 - 26:49
模型的成本,通常这也是你判断的方法之一。顺便说一句,是不是更大的模型,你只要看每个 token 的成本就知道了。而他们显然只是通过增大模型规模,就拿到了非常好的表现。所以我认为这个领域里,至少其中一部分人,对此是感到惊讶的。之前有很多关于撞上 data wall(数据墙)的讨论,但看起来我们并没有真正撞上它。
Speaker 126:49 - 27:02
So the larger the model is, the more data it needs to ingest to be trained. And it seems like different companies kind of found different ways to overcome the fact that we don't have that much data on the internet. Speaker 126:49 - 27:02
所以模型越大,训练时就需要摄取越多数据。而看起来,不同公司某种程度上都找到了不同的方法,来克服这样一个事实:互联网上并没有那么多可用数据。
Speaker 227:03 - 27:12
Is the next frontier or the current frontier for data multimodal data? Is it synthetic data? Speaker 227:03 - 27:12
数据的下一个前沿,或者说当前的前沿,是 multimodal data(多模态数据)吗?还是 synthetic data(合成数据)?
Speaker 127:12 - 27:55
I think synthetic data can probably work well in in a data, data limited regime. I think multimodal is an interesting one. I definitely cannot talk about what we do internally, but I used to work on multimodal representation learning back in the days, and I always thought that it would really help your reasoning abilities if you have a lot of multimodal data. And I still think this, but for example, if you look at entropic models, they tend to not be that good on multimodal, and they are still really smart. So it seems that it's not as necessary as at least I would have thought in the past. Speaker 127:12 - 27:55
我认为 synthetic data(合成数据)在数据受限的场景里很可能会表现很好。我觉得 multimodal(多模态)是个很有意思的方向。我当然不能谈我们内部在做什么,但我以前做过 multimodal representation learning(多模态表征学习),而且我一直觉得,如果你拥有大量 multimodal data(多模态数据),它会非常有助于提升推理能力。我现在仍然这么认为,但例如,如果你看 entropic models,它们在 multimodal 上往往并没有那么强,但依然非常聪明。所以看起来,至少没有我过去以为的那么必要。
Speaker 127:56 - 28:20
I still believe that once we go to embodied agents, embodied AI, you will learn a lot about the world, and you will kind of improve general intelligence and usefulness to users by learning how the world interacts with itself. But at least looking, for example, in traffic models, it seems like they don't need that much multimodal data to have a strong model. Speaker 127:56 - 28:20
我仍然相信,一旦我们走向 embodied agents(具身 agent)、embodied AI(具身 AI),你就会学到很多关于世界的东西,而且通过学习世界如何与自身交互,你会在 general intelligence(通用智能)和对用户的实用性上得到提升。但至少从一些例子来看,比如 traffic models,似乎它们并不需要那么多 multimodal data,也能成为很强的模型。
Speaker 228:21 - 28:33
And by embodied intelligence, so you mean so potentially robotics. And so if you use a video, that shows how gravity works and how a robot evolves in space, then presumably that would be more useful. Is that is that the thought? Speaker 228:21 - 28:33
你说的 embodied intelligence,意思就是也可能包括 robotics。也就是说,如果你使用一段 video,展示重力如何运作,以及一个机器人如何在空间中运动,那么按理说这会更有用。你的意思是这样吗?
Speaker 128:34 - 29:23
Yes. The the idea the the intuition that I think many people had, and I I definitely thought for a long time, is that it's hard to understand the world, only through text. And and there will be, it's hard to understand what, like, what physics is without really seeing what, like, for example, you can't understand gravity without really seeing things falling, and when you look at our models, I mean, they kind of understand gravity without having seen that, but it still seems not obvious. Like it still seems like they would get it more, and like they are still kind of missing some common sense aspects, So I do feel like we will improve the common sense of our model by having them interact in the real world. But we're still pretty far from that, I think. Speaker 128:34 - 29:23
是的。我认为很多人都曾有过、而且我自己也长期相信的一种想法或直觉是:仅仅通过 text,很难理解这个世界。而且,很多东西确实很难只靠文字去理解,比如如果你不真正看到物体下落,就很难理解 gravity。可当你看我们的模型时,我的意思是,它们某种程度上在没见过这些的情况下也理解了 gravity,但这件事仍然并不显然。也就是说,感觉它们本来应该还能理解得更多,而且它们现在仍然缺少一些 common sense(常识)层面的东西。所以我确实觉得,让模型在真实世界中进行交互,会提升它们的 common sense。但我认为我们离那一步还很远。
Speaker 129:23 - 29:29
And by we, I mean, just generally the academic community and the AI community seems pretty far from Speaker 129:23 - 29:29
我说的“我们”,指的是更广义上的 academic community 和 AI community;看起来大家距离那个目标都还相当远。
Speaker 229:29 - 29:42
Yeah. And while we're on the topic, as a quick detour that leads us to the concept of world models. Taking your open eye hat off, are you bullish on world models? Speaker 229:29 - 29:42
对了,既然说到这个话题,稍微岔开一下,这也会把我们带到 world models(世界模型)这个概念。先不站在 OpenAI 的立场上,你看好 world models 吗?
Speaker 129:42 - 30:49
World models in the sense that, yes, you can try to replicate or simulate things, basically work in an environment that is simulated. Yes, the problem is simulations are always going to be really hard and not going to be truthful. So I think there will always need to be a certain a little bit of training that will need to happen in the real world to make sure that the model realizes kind of these mismatches between the simulated world and the real world. And I think we as a field have a tendency of optimizing something that is simulated or not quite realistic past the point where this is useful. So that's something that I think we should always be careful with, we spend a lot of time and effort on optimizing something simulated and not quite realistic, and it's great at the beginning, but at some point once you start optimizing too much for something, it's not representative of the real world, and people continue doing that just because that's what they've been doing for a long time. Speaker 129:42 - 30:49
如果这里说的 world models,是指你可以尝试去复制或模拟事物,基本上在一个被模拟出来的 environment(环境)里工作,那么是的,我看好。问题在于,simulations(模拟)总是会非常困难,而且不可能完全真实。所以我认为,始终都需要有一定程度、哪怕只是一点点、发生在真实世界中的训练,以确保模型能够意识到 simulated world 和 real world 之间的这些不匹配。我还觉得,我们这个领域有一种倾向,就是把一个模拟出来的、或者不那么真实的东西优化到超过它真正有用的程度。这是我认为我们必须始终小心的事情:我们会花大量时间和精力去优化一个模拟的、并不完全真实的对象,起初这当然很好,但到了某个节点,一旦你对某样东西优化过头了,它就不再能代表真实世界;可人们还是会继续这么做,只因为他们已经这么做了很长时间。
Speaker 130:49 - 31:05
So I just think people need to realize when to stop that. I don't work with these type of synthetic environments as much, or just because I don't work on embodied AI. So I don't know if we heard that yet. Speaker 130:49 - 31:05
所以我只是觉得,人们需要知道什么时候该停下来。我自己并没有那么多地接触这类 synthetic environments(合成环境),只是因为我并不做 embodied AI。所以我也不确定我们之前是否已经谈到过这一点。
Speaker 231:05 - 31:10
Okay. Great. Alright. So going back to pretraining, mid training, post training. Let's talk about mid training. Speaker 231:05 - 31:10
好。很好。好,那回到 pretraining、mid training、post training。我们来聊聊 mid training。
Speaker 231:10 - 31:17
It's it's maybe something that people have heard about a bit less. The term comes up a bit less. What is it and why is it important? Speaker 231:10 - 31:17
这可能是大家相对较少听到的一件事,这个术语出现得也没那么频繁。它到底是什么,为什么重要?
Speaker 131:17 - 32:00
Mid training is just this idea of something that's between pre training and as you might realize from the name and kind of the post training part of the pipeline. And really, the idea is if you have high quality data that is more representative of what you really want in your final model, you should overtrain on that data. Taking a step back here, pre training, what is it? Pre training, it's basically trying to learn everything from the world by learning everything from Internet at a high level. The problem is that most things on Internet are not really useful. Speaker 131:17 - 32:00
Mid training 的意思其实就是:它处在 pre training 和——从名字你也能听出来——pipeline(流程)里的 post training 之间。核心想法是,如果你有高质量数据,而且这些数据更能代表你最终希望模型具备的能力,那你就应该在这些数据上进行过采样式训练、更多地训练。退一步说,pre training 是什么?Pre training 本质上是在一个很高的层面上,通过学习 Internet 上的一切,来尽可能学习这个世界上的一切。问题在于,Internet 上的大多数东西其实并没有那么有用。
Speaker 132:01 - 32:19
If you think, for example, about Wikipedia or, like, GitHub, which is, like, coding data, it just seems like there's way more information in there than some random forums. Yeah. Some some random forums that may maybe not, like, have that much information. Like, for example, ads. There's also lots of ads on Internet. Speaker 132:01 - 32:19
比如说,如果你想想 Wikipedia,或者 GitHub 这类 coding data(代码数据),会觉得里面的信息显然比一些随机论坛多得多。对,一些随机论坛可能并没有那么多信息量。再比如 ads(广告)。Internet 上也有大量广告。
Speaker 132:19 - 32:44
Like, you probably don't wanna train too much on that. But in pretraining, train on everything, and in mid training, we basically overweight this type of high quality data that we think is more useful, for for training the final model. And this is something I I can't talk about what is happening in everybody here, but it's, like, something that that is happening definitely in all the academic community right now and in all the open source models have this stage of mid training. Speaker 132:19 - 32:44
这类内容你大概率并不想在训练时占太大比重。但在 pretraining 中,你是对一切都进行训练;而在 mid training 中,我们本质上就是给这类我们认为对最终模型训练更有用的高质量数据更高的权重。至于这里每个人具体在做什么,我不能展开讲,但可以说,这绝对是现在整个 academic community(学术界)都在做的事情,而且所有 open source models(开源模型)都有这个 mid training 阶段。
Speaker 232:44 - 32:52
Great. Post training, let let's start, at a high level by by defining what that is. So there's reinforcement learning, but that's not the only part of post training. What what else is there? Speaker 232:44 - 32:52
很好。说到 post training,我们先从高层面定义一下它是什么。这里面有 reinforcement learning(强化学习),但那并不是 post training 的全部。除此之外还有什么?
Speaker 132:52 - 33:39
It kind of depends how you define the term, and where you put the boundaries. In my mind, post training, I'll take it from a very broad sense, which includes all the reinforcement learning and the training for our reasoning models. It's just the idea of having something that knows everything about world to making something that is useful to people. Pre training, I think about it, or the metaphor that I like giving is you go in the library and you have a lot of books about everything, and in theory, you can find all the information that you want in the library. But it's much more useful to talk to an expert who has learned these books and that you can ask questions to, and they can answer they can answer, and, like, they they can understand, like, what you're actually looking for. Speaker 132:52 - 33:39
这有点取决于你怎么定义这个术语,以及你把边界画在哪里。在我看来,post training 可以从一个非常宽泛的意义来理解,它包括所有 reinforcement learning,以及针对 reasoning models(推理模型)的训练。它的核心就是:把一个“知道世界上一切”的东西,变成一个“对人有用”的东西。对 pre training,我常用的比喻是:你走进一座图书馆,里面有大量关于各种主题的书,从理论上说,你可以在图书馆里找到你想要的所有信息。但更有用的是去和一位真正读过这些书的专家交流,你可以向他提问,他能回答你,而且他们还能理解你真正想找的到底是什么。
Speaker 133:40 - 34:04
So this is kind of the goal of pushing at a very high level, is making something that is useful to users and is easier to interact with. So there are multiple stages. I'll talk mostly well, I'll talk only about things that are happening outside of OpenAI and kind of the the usual stages. There's usually some SFT that is happening. Speaker 133:40 - 34:04
所以,从一个很高的层面看,这整个推进的目标,就是做出对用户有用、也更容易交互的东西。因此这里面有多个阶段。我主要会讲——准确说,我只会讲 OpenAI 之外正在发生的事情,以及一些常见阶段。通常会先有一些 SFT(监督微调)。
Speaker 234:04 - 34:06
Which is supervised fine tuning. Speaker 234:04 - 34:06
这就是 supervised fine tuning(监督式微调)。
Speaker 134:06 - 34:36
Supervised fine tuning. Yes. Supervised fine tuning, and that's actually what early on most of the models that we're posturing were only doing supervised fine tuning. This is the idea is that if you have humans that can give you the desired final answer, so if you have if, yeah, if you have humans that can give you the gold answer, you can basically clone the behavior of the human. So this is what we call behavior cloning. Speaker 134:06 - 34:36
Supervised fine tuning(监督式微调)。对,supervised fine tuning(监督式微调)。实际上,早期我们当时在讨论的大多数模型,做的基本都只有 supervised fine tuning(监督式微调)。它的核心想法是:如果有人类能够给出你想要的最终答案,也就是,如果你有人类能给出 gold answer(标准答案),那么你基本上就可以克隆这种人类行为。所以这就是我们所说的 behavior cloning(行为克隆)。
Speaker 134:36 - 35:10
The problem with this is that you will never get better than what your ground truth gives you. And humans are actually pretty limited in many many sense, so you will never, like, overcome the human labelers that you're working with. The reinforcement learning reinforcement learning stage goes from behavior cloning to really like optimizing rewards. So the idea is, I don't know what the ground truth is. I don't know what the perfect answer is, but here's how I would say whether the answer is correct or not. Speaker 134:36 - 35:10
这里的问题在于,你永远不可能比你的 ground truth(真实标注)更好。而人类在很多很多方面其实都相当有限,所以你永远无法超越与你合作的那些 human labelers(人工标注者)。reinforcement learning(强化学习)这个阶段,则是从 behavior cloning(行为克隆)走向真正去优化 reward(奖励)。它的想法是:我不知道 ground truth(真实答案)是什么,我也不知道完美答案是什么,但我可以告诉你,我会如何判断一个答案是不是正确。
Speaker 135:10 - 35:46
And here are the things that I want in the answer. And what you do is you start optimizing, you start having a model that tries to get more reward, basically optimize more this reward function that's how we call it. And it goes beyond what you currently have, what humans can do, what at least the humans that you're working with can do. So this this, I would say, is the two big stages. Then in reinforcement learning, that depends in, like, which models are being trained, at least in the open source community, it seems that there are there are different ways of doing that. Speaker 135:10 - 35:46
以及,我希望答案里包含哪些东西。然后你要做的,就是开始优化:让模型去争取获得更多 reward(奖励),本质上就是去更充分地优化这个 reward function(奖励函数),我们就是这么称呼它的。这样一来,它就能超越你当前已有的水平,超越人类能做到的,至少是超越你正在合作的那些人类所能做到的。所以我会说,这是两个大的阶段。然后到了 reinforcement learning(强化学习)这里,具体又取决于你在训练哪一类模型;至少在 open source community(开源社区)里,看起来做法是有不同路线的。
Speaker 135:46 - 36:15
Reinforcement learning, when you have very fireball rewards. So reinforcement learning where it's really easy to say whether something is correct or not, and you can really kind of have a binary reward for this, and that goes back to how we talked about one and one preview in the past. And then you have reinforcement learning without verifiable rewards, where maybe I could do pairwise comparisons. I can say this answer is better than this other one, but I don't really know. I cannot quite say this is the perfect answer. Speaker 135:46 - 36:15
一种是 reinforcement learning(强化学习)配合非常 verifiable rewards(可验证奖励)的情况。也就是那种很容易判断一件事对不对的强化学习,你几乎可以给出一个 binary reward(二元奖励),这也能对应到我们之前谈过的 one and one preview。另一种则是 reinforcement learning(强化学习)without verifiable rewards(无可验证奖励),在这种情况下,也许我可以做 pairwise comparisons(成对比较),我可以说这个答案比另一个更好,但我并不真正知道,我没法明确地说这就是完美答案。
Speaker 136:17 - 37:10
So of course, like it's a continuum and there's everything in between, but I would say these are like the three high level things to think about when you think about post training in general. How people are usually doing it in the open source world is that they take SFT, they clone the behavior that you you can collect online or from humans, and then once it's already at a pretty good level, they just do this reinforcement learning to go beyond what we currently have. Because if you just started from reinforcement learning, it would be very inefficient. Because the problem with reinforcement learning is that you have to stumble across the right answer, basically. Because how how reinforcement learning works is you sample many times essentially from the model that you're training, and you say this one is correct, this one is not, and you say do more of the one that is correct. Speaker 136:17 - 37:10
所以当然,这其实是一个 continuum(连续谱),中间还有各种各样的情况;但我会说,当你从整体上思考 post training(后训练)时,可以重点考虑这三个高层次的东西。人们在 open source(开源)世界里通常的做法是:先做 SFT(supervised fine tuning,监督式微调),把你能从线上收集到的、或者从人类那里收集到的行为先克隆下来;然后等模型已经达到相当不错的水平之后,再做 reinforcement learning(强化学习),去超越我们当前已有的能力。因为如果你一开始就直接做 reinforcement learning(强化学习),效率会非常低。原因在于,reinforcement learning(强化学习)的问题是:你基本上必须先“误打误撞”碰到正确答案。因为 reinforcement learning(强化学习)的工作方式,本质上是:你从正在训练的模型里做很多次采样,然后你说,这个是对的,这个是不对的,再告诉模型,多去做那些对的。
Speaker 137:10 - 37:21
So you have to stumble across the right solution. So you're much better off first getting as much as close as possible to the best you can do. And this is this behavior cloning and then doing reinforcement learning. Speaker 137:10 - 37:21
所以你必须先“撞上”正确的解法。因此,更好的策略是,先尽可能让模型接近你所能做到的最佳水平。这一步就是 behavior cloning(行为克隆),然后再做 reinforcement learning(强化学习)。
Speaker 237:21 - 37:29
Does reinforcement learning create new capabilities or does it make the model better at existing capabilities? Speaker 237:21 - 37:29
reinforcement learning(强化学习)会创造新的能力,还是只是让模型在已有能力上表现得更好?
Speaker 137:29 - 38:18
It's really hard to say because pre training, when it's trained on all of the internet, arguably already has all capabilities in it. So it would be even hard to answer this question scientifically because arguably everything is already there. What I would say is that if you look at models that we were training or that we were post training like two years ago in the open source world, For example, I worked on one of them, Alpaca, where we used 50,000 examples for SFT. And like now, when you look at reinforcement learning from models like Kimi or from DeepSeek models, it seems that they are closer to 1,000,000 data points. So definitely people scaled up a lot the reinforcement learning stage. Speaker 137:29 - 38:18
这其实很难说,因为 pre training(预训练)在模型基于整个互联网进行训练时,按某种说法就已经把所有能力都包含进去了。所以要用科学方式回答这个问题也会很难,因为可以说一切本来就已经在那里了。我想说的是,如果你看我们两年前在 open source 世界里训练、或者做 post training(后训练)的那些模型,例如我参与过的 Alpaca,当时我们用 50,000 个样本做 SFT(监督微调)。而现在,如果你看像 Kimi 或 DeepSeek 这类模型中的 reinforcement learning(强化学习),规模似乎已经接近 1,000,000 个 data points(数据点)。所以很明显,人们把 reinforcement learning 这一阶段的规模大幅提升了。
Speaker 138:20 - 38:53
From this, it seems that they've learned new capability, like this reasoning aspect, this fact that you can check your answer and try to improve it, so you can really think for longer to get a more correct answer. All this to say that arguably everything is already in pre training, but we were definitely able in the last one year and a half, even in the open source world, to have more capabilities after reinforcement learning than we used to before. Speaker 138:20 - 38:53
从这一点看,似乎它们学会了新的 capability(能力),比如这种 reasoning(推理)方面的能力:你可以检查自己的答案,并尝试改进它,因此你确实可以思考更久,从而得到更正确的答案。总之,可以说一切本来都已经在 pre training 里了,但可以肯定的是,在过去一年半里,即使是在 open source 世界中,我们也确实能通过 reinforcement learning 获得比以前更多的能力。
Speaker 238:53 - 39:20
I heard several times that reinforcement learning is pretty finicky and hard to scale and part of the reason why we as an industry didn't do reinforcement learning as part of the initial kind of LLM sort of a progress curve was precisely that, that it was hard to to make work. What is hard about scaling RL? Is that a question of, datasets, knowing where the rewards are? Is there is that or something else? Speaker 238:53 - 39:20
我好几次听说 reinforcement learning 非常 finicky(脆弱、难调),而且很难 scale(扩展规模);我们整个行业一开始没有把 reinforcement learning 纳入 LLM(大语言模型)早期发展曲线的一部分,原因之一恰恰就是这一点——它很难真正跑通。RL(强化学习)在 scale 上到底难在哪里?这是 dataset(数据集)的问题、是要知道 reward(奖励)在哪里的问题,还是别的什么问题?
Speaker 139:20 - 40:04
I would say most people who did not work in reinforcement learning in the academic and like in research community up to two years ago probably thought reinforcement learning would would just doesn't work and is like too finicky to to work with. I used to be that type of person and actually when I saw ChatGPT come out they had this blog. I was not at OpenAI at the I saw this blog that says that they use reinforcement learning, and my first thought was I can do the same without reinforcement learning, because this is just an over complicated method, and this is actually the product that we started working on with Alpaca, was exactly let's try to reproduce that only using SFT just by doing this behavior cloning. Yeah. And and, like, for example, Yan Leuke famously, like, gives, like, this metaphor of, like, oh, the the reinforcement is just, the cherry on the top. Speaker 139:20 - 40:04
我会说,大多数在两年前之前没有在学术界或研究社区做过 reinforcement learning 的人,可能都会觉得 reinforcement learning 根本行不通,或者说它太 finicky,难以实际使用。我以前就是这种人。事实上,当我看到 ChatGPT 刚出来时,他们发了一篇 blog;那时我还不在 OpenAI。我看到那篇 blog 说他们用了 reinforcement learning,我的第一反应是:不用 reinforcement learning 我也能做出同样的东西,因为这只是一个过度复杂的方法。而这实际上正是我们后来在 Alpaca 上开始做的产品:试着只用 SFT,通过这种 behavior cloning(行为克隆)来复现它。对。还有,比如 Yan Leuke 就很有名地打过这样一个比方:reinforcement learning 只是蛋糕上的樱桃。
Speaker 140:04 - 40:57
So I think that was really like, the intuition that most people had, it seems that after crossing a certain scale of models that know basically everything about the world and what we call, like, good priors about the world, it seems that reinforcement learning just started to work. And this is not only with LMs. Robotics seems to be entering the same stage where they're realizing that actually it used to be very finicky, but now that we use models that know already everything about the world, it actually learns pretty well. Now, to answer your question about what is still complicated with reinforcement learning, one is an infra aspect. So just like systems in general, reinforcement learning, have at a very high level, basically to sample, as I said before, many answers and say like what is correct and what is not. Speaker 140:04 - 40:57
所以我觉得,这确实就是当时大多数人的直觉。看起来,一旦跨过了某个规模门槛,模型基本上已经“知道”关于世界的一切,并且具备我们所说的关于世界的 good priors(良好先验),reinforcement learning 就开始奏效了。而且这不只是发生在 LM(语言模型)上。Robotics(机器人学)似乎也在进入同样的阶段:他们意识到,它以前确实非常 finicky,但现在因为我们使用的是已经“知道”世界中几乎一切的模型,它实际上学得相当好。现在回答你关于 reinforcement learning 还有什么仍然复杂的问题:其中一个是 infra(基础设施)层面。所以从很高的层次上说,reinforcement learning 本质上就是要像我前面说的那样,对很多答案进行 sample(采样),然后判断什么是正确的、什么是不正确的。
Speaker 140:59 - 41:42
Like this sampling is just very expensive and you have to do it at scale. The other issue that also in the open source world people are seeing right now is that when we are training more agentic systems, you only know whether you're correct at the end of your very long rollout. So you get very little information per token of whether you were correct or not, and it's hard to say it's hard to basically do attribution. It's hard to say what part of your entire answer was the one that led you to be incorrect. So that's more of an issue on the machine learning side. Speaker 140:59 - 41:42
这种 sampling(采样)本身就非常昂贵,而且你还必须在大规模下去做。另一个问题是,现在 open source 世界里的人也正在看到:当我们训练更 agentic(具备 agent 特征、自主行动)的系统时,你往往只有在很长的 rollout(展开过程)结束时,才知道自己是否做对了。所以对于每个 token(词元)来说,你能获得的“自己到底对没对”的信息非常少,也就很难做 attribution(归因)。很难说清,在你整段回答里,到底是哪一部分导致了最终出错。所以这更多是 machine learning(机器学习)层面的问题。
Speaker 141:43 - 42:01
The ideal world in machine learning is when I can say exactly like, this thing was good, do more of that. And the problem again with these agentic systems and reinforcement learning with agentic systems is that you don't really know which part was good or not until you arrive at the end. That's another big issue for reinforcement learning. Speaker 141:43 - 42:01
在 machine learning 的理想世界里,我可以非常明确地说:这个东西是好的,那就多做一点这个。问题在于,还是这些 agentic systems(agent 系统)以及面向 agentic systems 的 reinforcement learning:在你走到最后之前,你其实并不知道到底是哪一部分是好的、哪一部分是不好的。这是 reinforcement learning 的另一个大问题。
Speaker 242:01 - 42:17
What's the current frontier of reinforcement learning? It seems like there's a jungle of acronyms like GRPO and, other techniques. What, what are you using? What are you excited about? What do you think is promising? Speaker 242:01 - 42:17
现在 reinforcement learning 的前沿是什么?看起来好像有一大片 acronym(缩写词)丛林,比如 GRPO 以及其他技术。你们在用什么?你对什么感到兴奋?你觉得哪些方向有前景?
Speaker 142:17 - 43:09
So I can talk about what we're using, but, like, for example, the open source world, gRPO seems to be working very well, and people used to have different methods like PPO and and DPO, and, like, people seem to have really converged to this one. The big The big difference with other methods is that you do this simple method that I told you about sampling as many answers as possible, and you say which one is correct. So in some way, GRPO is a very simplistic method. And in general, we saw over and over again in machine learning that the simplest method where you can scale up in terms of compute usually is the one that ends up working the best. And that is kind of what is happening here, at least in the open source world. Speaker 142:17 - 43:09
所以我可以讲讲我们在用什么,但比如说,在 open source 世界里,GRPO 似乎效果非常好。人们以前会用不同的方法,比如 PPO 和 DPO,而现在看起来大家确实已经收敛到这一种了。它和其他方法的一个很大区别是:你用我刚才说的那种简单方法,尽可能多地 sample(采样)答案,然后判断哪个是正确的。所以某种意义上,GRPO 是一种非常简洁的方法。总体来说,我们在 machine learning(机器学习)里一次又一次地看到,最简单、而且能在 compute(算力)规模上扩展的方法,通常最终效果最好。而这大致就是这里正在发生的事,至少在 open source 世界是这样。
Speaker 243:09 - 43:27
As you described some of the challenges, question crossed my mind. You know, you often hear that AI systems are not built, they're grown. How you'd characterize it as well? What part is science versus a craft or trying multiple things and then just keeping what works best in your day to day life? Speaker 243:09 - 43:27
你刚才描述了一些挑战,这让我想到一个问题。你知道,人们经常说 AI 系统不是 built(建造)出来的,而是 grown(生长)出来的。你也会这样描述吗?在你的日常工作里,哪一部分更像 science(科学),哪一部分更像 craft(手艺),或者说是反复尝试很多东西,然后把效果最好的留下来?
Speaker 143:27 - 44:08
Yeah, that's a great question. I think how it usually works is that it starts being craft. People just try out many things, and they start building a mental model of what works and what doesn't, and over time we move from this craft land to more science. Science is or more scientific approach are really the ones that first end up working. It's very rare that you take a really scientific approach and you say, like, this is the optimal thing to do, and you do it and it just works. Speaker 143:27 - 44:08
是的,这是个很好的问题。我觉得通常的过程是,先从 craft 开始。人们就是去尝试很多东西,并开始建立一种 mental model(心智模型),知道什么有效、什么无效。随着时间推移,我们会从这种 craft 的地带走向更 science 的阶段。真正起作用的,往往是那些先“做出来”的东西,而不是一开始就来自某种很科学的方法。很少会出现这样一种情况:你采取一种非常科学的方法,说“这就是最优做法”,然后照着做,它就直接成功了。
Speaker 144:08 - 44:40
Like, people are just there's some sense of alchemy. People just have, like, a good flair for something and they make it work, and then other people or that person starts trying to improve what we are doing by being very scientific. And I would say this happens over and over in machine learning. So first craft, then science, and both are really important, but it's different stages of the pipeline. In terms of engineering, this is definitely something that is always necessary. Speaker 144:08 - 44:40
比如说,人们做事总有一点 alchemy(炼金术)的感觉。有人就是对某件事有很好的直觉,然后把它做成了;接着,其他人,或者这个人自己,才开始用非常科学的方式去改进我们正在做的东西。我会说,这种情况在 machine learning 里一次又一次地发生。所以先是 craft,再是 science,两者都非常重要,只是它们处在 pipeline(流程)的不同阶段。至于 engineering(工程)方面,这当然是始终都必不可少的。
Speaker 144:41 - 45:09
So I would say most researchers have moved to being relatively good at figuring at least, I wouldn't say good engineers, but good at working in complex systems and figuring out what they need to try out. The systems the the and the infra that we have has become more and more complicated. So so definitely the the work required changes over time. Speaker 144:41 - 45:09
所以我会说,大多数 researcher(研究者)现在都已经比较擅长这一点了——至少,我不会说他们是优秀的 engineer(工程师),但他们很擅长在复杂系统里工作,并搞清楚自己需要尝试什么。我们拥有的系统和 infra(基础设施)已经变得越来越复杂了。所以,所需要的工作类型肯定也会随着时间而变化。
Speaker 245:09 - 45:36
Fascinating. Alright. So still in a reinforcement learning and circling back to some of the things you said at the at the beginning. So if I wanna make my model better at computer use or genetic coding or whatever domain, then I would spend particular amount of time doing specifically reinforcement learning for computer use and putting together a dataset and then coming up with rewards. Is that is that how it works? Speaker 245:09 - 45:36
很有意思。好,我们还是留在 reinforcement learning(强化学习)这个话题上,也回到你一开始提到的一些内容。所以如果我想让我的模型在 computer use、genetic coding,或者其他任何领域上表现更好,那我是不是就要专门花一部分时间,为 computer use 这件事做特定的 reinforcement learning,整理相应的 dataset(数据集),然后设计 rewards(奖励)?事情是这样运作的吗?
Speaker 245:36 - 45:42
Like, you you just pick one problem and you just do reinforcement learning specifically for it? Speaker 245:36 - 45:42
比如说,你就是挑一个问题,然后专门为它做 reinforcement learning?
Speaker 145:42 - 46:17
To be clear, I talk more about reinforcement learning because also this is like the part I know the best. And this is what I've worked like, pushing, I've worked on for a long time. We talked about mid training before. All these things are also extremely important, and you can improve it in different parts of the pipeline. As I said before, the closer you are from the final stage of the model, usually the smaller the scale of the training becomes, so you can iterate fast on that because now you can iterate in terms of days rather than iterate in terms of months. Speaker 145:42 - 46:17
先说明一下,我之所以更多谈 reinforcement learning,也是因为这是我最了解的部分。这也是我长期一直在推动、一直在做的工作。我们之前谈过 mid training,这些东西同样都极其重要,而且你可以在 pipeline 的不同环节去改进它。正如我之前说的,离模型最终阶段越近,training(训练)的规模通常就越小,因此你可以在那上面快速迭代,因为这时你迭代的单位是几天,而不是几个月。
Speaker 146:18 - 46:46
Usually, start from this fast situation loop, and then they go deeper, and they make, like, bigger changes, across the entire stack. So this is not to say that, only, like, reinforcement learning matters. I'm really not saying that. But it's just that, like, that's why people will start doing, changes, and then they will that will permeate, and we will go deeper into the stack. So this is how it works, and in the open source world, it's very much like that too. Speaker 146:18 - 46:46
通常会先从这种快速的情境循环开始,然后再往更深处走,对整个 stack(技术栈)做更大的改动。所以这并不是说只有 reinforcement learning(强化学习)才重要。我真的不是这个意思。只是说,这往往是人们开始做改动的切入点,然后这些改动会逐步渗透,并继续向 stack 的更深层推进。事情大致就是这样运作的,在 open source(开源)世界里也非常像这样。
Speaker 146:46 - 47:05
I think you see way more post trained models than you see new pre trained bases, And you see way more improvements in the algorithm, and that's why we talked about, I mean, GRPO, DPO, PPO. There are so many XPOs, and that's because people can iterate really quickly on its final stage of the pipeline. Speaker 146:46 - 47:05
我觉得,你看到的 post-trained models(后训练模型)远远多于新的 pre-trained base(预训练基座模型);你看到的算法改进也更多。这也是为什么我们会谈到,比如 GRPO、DPO、PPO。有这么多各种各样的 XPO,就是因为人们可以在这条 pipeline(流程)的最后阶段非常快速地迭代。
Speaker 247:06 - 47:24
And the jagged nature of, those models, does that come from this approach of, picking this problem and that problem, and therefore it's gonna be excellent at those problems, but not as good as other problems? Or is that a more fundamental characteristic of AI models? Speaker 247:06 - 47:24
那这些模型那种 jagged(参差不齐、不平滑)的特性,是不是来自这种做法:挑这个问题、挑那个问题来优化,所以它会在这些问题上表现得特别出色,但在别的问题上就没那么好?还是说,这是 AI 模型更根本的一种特征?
Speaker 147:24 - 48:04
There's definitely some of that. For sure, if you optimize more on specific types of problems, you will be better in that setting. I would say, with my intuition, is that it's less about the exact problems that you're optimizing on, and it's more about the class of problems that you're optimizing on. So for example, if you are really good at math competitions, your model will probably be pretty good at coding competitions. So it's not about the domain, it's more about the skills that are necessary and the way to think and this horizontal capabilities that you need for performing these tasks. Speaker 147:24 - 48:04
这方面肯定是有一些影响的。毫无疑问,如果你更多地针对某些特定类型的问题做优化,那么在那种场景下你就会更强。但按我的直觉来说,关键不太在于你优化的具体是哪一道题,而更多在于你优化的是哪一类问题。举个例子,如果你在数学竞赛上特别强,那你的模型大概率在编程竞赛上也会相当不错。所以问题不在于 domain(领域)本身,而更多在于完成这些任务所需要的技能、思维方式,以及那些横向能力。
Speaker 148:05 - 48:21
And that's what I think you are usually seeing. When some model is really bad at something, it's actually bad at that in any domain, in any language. So you have to think, yeah, about this domain and then this generalization of this domain, not necessarily per domain capability. Speaker 148:05 - 48:21
而我觉得你通常看到的就是这种情况。当某个模型在某件事上真的很差时,它其实会在任何 domain(领域)、任何语言里都在这件事上表现很差。所以你的思考方式应该是:关注这个 domain(领域),以及这个 domain 的泛化,而不一定是按单个 domain 去理解能力。
Speaker 248:21 - 48:56
So speaking of generalization, so there's been that clear evolution from math and coding success to now starting to cover different areas. So that's the whole GDP valve thing where, like, across the economy, different areas are being evaluated in terms of, model performance. Sort of same question. Is that the result of overall model progress? Or is that a deliberate, okay, now we're gonna take you know, this part of the economy and build a dataset for it and do mid training and do post training? Speaker 248:21 - 48:56
说到 generalization(泛化),之前确实能看到一个很清晰的演进:从数学和编程上的成功,开始扩展到不同的领域。这也就是整个 GDP valve 那件事——也就是在整个经济范围内,用模型表现来评估不同领域。还是类似的问题:这主要是整体模型进步的结果?还是一种有意识的做法——比如,好,现在我们要拿经济中的这一部分,专门为它构建 dataset(数据集),做 mid-training(中期训练),再做 post-training(后训练)?
Speaker 248:56 - 49:04
How does that progress work from those very specific domains to generalizing to the rest of the world? Speaker 248:56 - 49:04
这种进展是如何从那些非常具体的领域,逐步走向对世界其余部分的泛化的?
Speaker 149:04 - 50:02
It's definitely something that we actively push on. I think people are realizing, I mean, us and also other companies, that we are moving towards this world where we want to really make products that are useful and improve productivity of people and help people in their day to day life. So I think there's a very active move to deciding what are the domains that we should be prioritizing. Now that we know we have an algorithm that we can apply in different places, what we are constrained by is more collecting the right data, having people who really care about a certain problem work on that problem, but there are not that many people who can do these things, so you really need to prioritize. So this is Yeah, it's a very active, proactive kind of approach here. Speaker 149:04 - 50:02
这绝对是我们在主动推进的事情。我觉得人们——包括我们,也包括其他公司——都在意识到,我们正在走向这样一个世界:我们真正想做的是有用的产品,去提升人们的生产力,帮助人们的日常生活。所以我认为,现在有一个非常积极的转向,就是去决定哪些 domain(领域)应该被优先投入。既然我们已经知道自己有一种可以应用到不同地方的算法,那么现在更大的约束反而是:收集合适的数据,让那些真正关心某个问题的人去做那个问题。但能够做这些事的人其实并没有那么多,所以你确实必须设定优先级。所以,是的,这里采取的是一种非常主动、前瞻性的方式。
Speaker 150:02 - 50:25
And in general, I would say the performance of the model really depends on, like, the number of people who care about the final output of the model and who are looking at that model. So if they start looking more on specific verticals, these verticals will improve really quickly. But again, we don't have that many of these people that can do these things. Speaker 150:02 - 50:25
总体来说,我会说,模型的表现确实取决于这样一些人的数量:他们关心模型最终输出,并且会实际去看这个模型。所以,如果他们开始更多关注某些特定 verticals(垂直领域),这些领域就会非常快地改进。但话说回来,能做这些事的人并没有那么多。
Speaker 250:26 - 50:53
But to unpack something that you alluded to, I think a minute ago, do models actually generalize now more, especially from a reinforcement learning perspective? So being making a model very, very good at domain A or B, is likely to make the model better at C, regardless of the amount of effort you put into developing, rewards for domain C. Speaker 250:26 - 50:53
不过,展开说一下你刚才暗示的那一点,我想是一分钟前提到的:模型现在是否真的更会泛化了,尤其是从 reinforcement learning(强化学习)的视角来看?也就是说,把一个模型在 domain A 或 B 上做得非常非常好,是否很可能会让这个模型在 C 上也变得更好,而不管你为 domain C 的 reward(奖励)设计投入了多少开发 effort(工作量)。
Speaker 150:53 - 51:25
So I think there are different axes of generalization. One, there's an algorithmic generalization, and and that's, like, really, can I use the algorithm that I developed for domain a, and can I use it for domain b? And again, like even talking about the open source wall, it really seems that it's like people are able to do that. They take GRP or they apply it in like many different places and it just works. So that generalization seems to be relatively good, is why we're seeing a lot of progress. Speaker 150:53 - 51:25
所以我认为,泛化有不同的维度。其一是 algorithmic generalization(算法层面的泛化),也就是,我为 domain A 开发出来的算法,能不能拿去用于 domain B?再说回 open source wall,这看起来确实像是人们已经能做到这一点了。他们拿 GRP,或者把它应用到很多不同地方,然后它就是能工作。所以这种泛化似乎相对来说是不错的,这也是我们看到大量进展的原因。
Speaker 151:25 - 51:50
Otherwise it would be hard to make progress. Then there's the generalization of the model that is trained on one particular dataset. And this is what I was alluding to before, is at least my mental model is that generalization happens in terms of capability. Like, if the capability is the same, you will see generalization across domains. Again, like multi like, different languages, like coding. Speaker 151:25 - 51:50
不然的话,进展会很难实现。然后还有一种泛化,是模型在某个特定 dataset(数据集)上训练之后的泛化。而这就是我前面提到的,至少在我的 mental model(心智模型)里,泛化是按 capability(能力)发生的。也就是说,如果能力是一样的,你就会看到跨 domain(领域)的泛化。比如说,多种不同语言,或者 coding(编程)。
Speaker 151:50 - 53:06
Like, you can optimize for C plus plus coding, for having a good C plus plus model, with very little training on C plus plus partly because this pre trained model very little OL in C plus plus partly because this pre trained model has seen all of C plus plus and so it already kind of understands the basics of that language. So that type of generalization definitely happens. The generalization that I think is harder are these when we don't have these horizontal capabilities. So I'll give you one concrete example. If my model is very intelligent in terms of being correct on competitions, I usually take that example because it's somewhat conscribed at math competitions, like coding competitions, From a human perspective, people that are good at these things are usually just smart, and if they are smart or someone might think that at least, that they are just smart, if they are smart, they can actually do other things too, but that is really not true, and that type of generalization is really not true because many things where we need to have humans working on expert domains The world is very messy, and these coding competitions and math competitions are extremely well specified. Speaker 151:50 - 53:06
比如,你可以针对 C plus plus coding 做优化,做出一个很好的 C plus plus 模型,即使在 C plus plus 上的训练非常少,部分原因是这个 pre-trained model(预训练模型)其实已经见过几乎所有 C plus plus 的内容,所以它某种程度上已经理解了这门语言的基础。因此,这种类型的泛化肯定会发生。我认为更难的泛化,是那些我们并不具备这种 horizontal capabilities(横向能力)的情况。我举一个具体例子。如果我的模型在 competitions(竞赛)里追求正确性方面非常聪明,我经常举这个例子,因为它的范围比较受限,比如数学竞赛、编程竞赛。从人的角度看,擅长这些事情的人通常就是聪明,而如果他们聪明——至少有人会这么认为——那么他们也应该能做别的事情,但事实并非如此,而且这种类型的泛化确实并不成立,因为许多需要人类在 expert domains(专家领域)中工作的事情,世界是非常混乱的,而这些编程竞赛和数学竞赛则是极其明确规定好的。
Speaker 153:06 - 53:47
And you need to have the capability of understanding underspecified tasks, understanding how to deal with the messy world, and understanding what are even the resources that you need to answer the question? Like, if you look at the at the math, competition, like, you usually have everything in the in the in the prompt. It's like you have five lines or maybe 15 lines, and it's, like, all the information that you need to answer this question. In the real world, if, if I'm a consultant, if I work in, like, finance, I need to go on the Internet. I need to, like, find and extract different information just to understand, before doing any of the reasoning, just to to be able to do that reasoning. Speaker 153:06 - 53:47
你需要具备这样一种能力:理解那些定义不充分的任务,理解如何处理这个混乱的世界,以及理解为了回答这个问题你到底需要哪些资源。比如,如果你看数学竞赛,通常你需要的一切都已经在 prompt(提示词)里了。可能就是五行,或者十五行,里面已经包含了回答这个问题所需的全部信息。但在现实世界里,如果我是 consultant(顾问),如果我在 finance(金融)领域工作,我就得上网。我得去寻找并提取各种不同的信息,只是为了先理解情况;在做任何推理之前,先要做到这一点,之后才谈得上进行推理。
Speaker 153:47 - 54:18
And this type of horizontal capability is the thing that doesn't usually You generalize if you have that horizontal capability, but in many cases, we don't have that horizontal capability. Yeah, that's why we hallucinate actually in every domain. When you have hallucination of LMs, if a model is really bad at saying that it doesn't know, that usually happens in every single domain. You won't have one domain where the model is extremely calibrated about its knowledge and another domain where it's not. Speaker 153:47 - 54:18
而这种 horizontal capability(横向能力)通常才是关键——如果你有这种横向能力,你就会泛化;但很多情况下,我们并没有这种横向能力。对,这其实也是为什么我们会在每个领域里都看到 hallucination(幻觉)。当 LMs(语言模型)出现 hallucination 时,如果一个模型非常不擅长承认自己不知道,这通常会发生在每一个领域里。你不会看到这样一种情况:模型在某一个领域对自己的知识校准得极其准确,而在另一个领域却不是这样。
Speaker 254:18 - 54:29
And as a quick detour, is hallucination also a reinforcement learning problem where you reward the behavior to say I don't know when it occurs? Speaker 254:18 - 54:29
顺便快速岔开一下,hallucination(幻觉)是不是也属于 reinforcement learning(强化学习)问题,也就是在模型应该说“我不知道”的时候,对这种行为给予 reward(奖励)?
Speaker 154:30 - 55:37
John Schulman has a great presentation about that, I think from like one or two years ago, where he was saying that if you do behavior cloning, so this like SFT that we talked about before, you be like, you will will basically reward and optimize for hallucination because what will happen or you could optimize for hallucination because what what will happen is if your model doesn't know about something, but now you say that the right answer is to say that something. So I'll give you I'll be very concrete. If the model doesn't know about a paper, and now in an answer that you give that is given by a a ground truth answer given by a human, you say, here's where I got the information, and then you cite that paper. Like, what you're actually optimizing the model to do is citing something that doesn't exist because it doesn't know that that paper exists, so John Sullivan had this, like, great presentation saying, like, SFT is gonna force, like, a hallucination, while in reinforcement learning, given that, as I said, you kind of sample from the model in the first place, extremely unlikely that it samples something that it doesn't know and it's correct. Speaker 154:30 - 55:37
John Schulman 有一个关于这件事的很棒的演讲,我记得大概是一两年前的。他当时说,如果你做 behavior cloning(行为克隆),也就是我们之前谈到的这种 SFT,那么你基本上就是在奖励并优化 hallucination(幻觉),或者说你有可能会把模型优化到更容易产生 hallucination。因为会发生这样的情况:如果模型本来不知道某件事,但你现在却告诉它,正确答案就是要那样说。我举个非常具体的例子。如果模型不知道某篇 paper,而你给它的答案——也就是由人类提供的 ground truth answer——里写着“这是我获取信息的来源”,然后还引用了那篇 paper,那么你实际上是在优化模型去引用一个它并不知道存在的东西,因为它根本不知道那篇 paper 存在。所以 John Schulman 有一个很精彩的演讲,大意是说,SFT 会迫使模型产生某种 hallucination;而在 reinforcement learning(强化学习)里,鉴于正如我刚才所说的,你一开始就是从模型本身进行采样,它极不可能采样出某个它并不知道、但又恰好正确的东西。
Speaker 155:37 - 56:04
That's like extremely unlikely, so you will never reward that behavior. You will only sample things that it doesn't know and being incorrect, and then you will kill that, kill that behavior. So so hallucination, at least the the intuition that people have, is that it it can come, for example, from from SFT and it it can come from this, like, portioning pipeline. But if you have good reinforcement in pipeline, that shouldn't happen too often. Speaker 155:37 - 56:04
这种情况极其不可能,所以你永远不会奖励那种行为。你只会采样到那些它不知道、而且还是错误的内容,然后你会把这种行为压掉、消灭掉。所以对于 hallucination,至少人们的直觉是,它比如说可能来自 SFT,也可能来自这种 post-training pipeline(后训练流程)中的这部分。但如果你的 reinforcement learning pipeline 做得好,这种事就不应该太常发生。
Speaker 256:04 - 56:24
And, going back to, generalization as well. Is there are there examples where, actually getting better at one domain, makes the model worse at, the rest? A little bit, to what you were saying about, like, some people are very good at math, some people are very good at English. Pretty often, they're not the same people. Speaker 256:04 - 56:24
还有,回到 generalization(泛化)这个话题。有没有这样的例子:实际上,当模型在某一个 domain(领域)上变得更好时,它在其他方面反而变差了?这有点像你刚才说的,有些人很擅长 math,有些人很擅长 English。很多时候,这并不是同一批人。
Speaker 156:24 - 56:53
In domains, usually not. What will happen, though, is you will make decisions based on which domain we optimize for. If you optimize for one domain, you will be able to optimize less for another one. It's not necessarily that optimizing for one thing will make the other one worse. It's just that as a result, can optimize less for the other one because you're compute constrained, you're data constrained, you have your your, human bottleneck also in terms of that work. Speaker 156:24 - 56:53
在具体 domain 上,通常不会。不过会发生的是,你必须根据我们要优化哪个 domain 来做决策。如果你优化某一个 domain,你能分配给另一个 domain 的优化空间就会更少。倒不一定是说,优化一件事会让另一件事变得更差。只是结果上,你没法对另一件事优化那么多,因为你受 compute(算力)限制、受 data(数据)限制,而且在这类工作上还受到 human bottleneck(人工瓶颈)的限制。
Speaker 156:53 - 57:50
What does happen is, you can have negative kind of generalization like bad generalization or negative transfer more for these horizontal aspects of the model. I'll give you a very concrete example. Explicit instruction following versus implicit instruction following. If I have a model, and this is we often hear, for example, from OpenAI models that they tend to be really good if you tell them exactly what you want, but as a result, sometimes we hear also that they're like less good if you are not as specific about what you wanted. For example, if I make a typo and I say, Change this file, and I make a typo in this file, an extremely good model at explicit instruction following will change the wrong file, the one that has a typo, but humans would probably realize that you made a typo. Speaker 156:53 - 57:50
不过确实会发生的是,在模型这些更“横向”的方面上,你可能会出现负向的 generalization,比如糟糕的泛化或者 negative transfer(负迁移)。我举一个非常具体的例子:explicit instruction following(显式指令遵循)和 implicit instruction following(隐式指令遵循)。如果我有一个模型——比如我们经常听到别人评价 OpenAI 的模型时会说——只要你把自己想要什么讲得非常准确,它们往往就特别强;但作为结果,我们有时也会听说,如果你的表达没那么具体,它们反而没那么好。比如,如果我打错字了,我说“Change this file”,而我在 this file 这个地方打错了,那么一个在 explicit instruction following 上极其优秀的模型,就会去改那个写错名字的文件;但人类大概率会意识到你是打错字了。
Speaker 157:52 - 58:05
As a result, there are cases where this explicit instruction following goes against this, like, implicit instruction following. So you will have cases where, basically, these horizontal, capabilities go against each other. Speaker 157:52 - 58:05
因此,确实会有一些情况,explicit instruction following 会和这种 implicit instruction following 相冲突。所以你会看到,基本上这些“横向”的 capabilities(能力)之间会彼此对冲。
Speaker 258:05 - 58:26
And maybe to close on this whole, reinforcement learning, conversation. So is your sense that as we progress from being excellent at coding and excellent at math and move to the rest of the economy, do you think that the rest of the economy is a tractable problem? Do you think we can get to the same level of performance ultimately? Speaker 258:05 - 58:26
也许作为我们这一整段关于 reinforcement learning 讨论的收尾。你的感觉是不是这样:当我们从“非常擅长 coding、非常擅长 math”继续推进到经济中的其他部分时,你觉得经济中其余部分是一个 tractable problem(可处理的问题)吗?你觉得我们最终能达到同样水平的表现吗?
Speaker 158:26 - 58:43
Yes. But. I was like, yes, we can. I don't think there's anything like really deeply special about these domains where we cannot optimize, and where we couldn't get the same with other domains. The but is for at least two reasons. Speaker 158:26 - 58:43
能。但是——我本来想说,能,我们可以做到。我不觉得这些领域有什么特别深层、特别特殊的地方,以至于我们无法优化,或者无法在其他领域达到同样的效果。这个“但是”至少有两个原因。
Speaker 158:43 - 59:31
The first one is most of the people working on these models are pretty good at coding, and they really care about coding because that's what they use as their current drivers, and there's nothing better than the user being also the one who, like, trains the model because, like, then they understand the issues. It's it's hard to really like, for me, for example, it's very hard to really understand, like, what should we change on the verdict like, on, like, legal aspects of the model if I don't understand anything about the legal domain. So that's one thing. The other thing that you will often hear about, and I mentioned also briefly about before, is this kind of verifiable rewards. There are domains where it's easier to say where something is correct or not. Speaker 158:43 - 59:31
第一件事是,大多数从事这些模型工作的人都很擅长 coding,而且他们非常在意 coding,因为那正是他们当前主要使用这些模型的场景。没有什么比“用户本身也是训练模型的人”更好的了,因为这样他们就真正理解其中的问题。比如对我来说,如果我对 legal(法律)领域一无所知,就很难真正理解我们应该如何调整模型在裁决、法律相关方面的表现。这是一点。另一点是你会经常听到的,我之前也简要提过,就是这种 verifiable rewards(可验证奖励)。有些领域更容易判断一件事到底是对还是不对。
Speaker 1 | 59:32 - 1:00:08 For example, in the case of cyber, like you mentioned that before that, like, cyber has been improving a lot, cyber capabilities are in models, and this is because in cyber, it's, like, extremely easy to say. If in are you correct? Like, did you find like, did the cyber issue that you find is a real issue or not? It's very easy to test it. And so there are domains where reinforcement learning is just, like, easier to apply, but there's nothing I would say in the capacity of the model that is constraining the model to be as good at legal and medical and other domains.
例如,在 cyber(网络安全)这个场景里,就像你之前提到的,模型在 cyber 能力上提升了很多,这是因为在 cyber 领域,判断对错极其容易。比如,你找到的 cyber 问题到底是不是真问题?这一点非常容易测试。所以有些领域本来就更容易应用 reinforcement learning(强化学习),但我不认为是模型能力本身限制了它在 legal、medical(医疗)以及其他领域做到同样优秀。
Speaker 1 | 1:00:09 - 1:00:19 So it is the the the short answer is we know less about these domains, and, definitely there are some domains that are easier to optimize for in reinforcement learning.
所以简短的回答是:我们对这些领域了解得更少;而且,确实有一些领域在 reinforcement learning 中更容易被优化。
Speaker 2 | 1:00:19 - 1:00:29 Great. Let's talk about evals for a minute. That's a hugely important topic. Maybe to start, why is it so hard to evaluate a model in the first place?
很好。我们来聊聊 evals(评估)这个话题。这是一个极其重要的主题。也许先从这里开始:为什么评估一个模型本身会这么难?
Speaker 1 | 1:00:30 - 1:01:16 Evaluation has been harder and harder as models become better, and that's because the tasks that we ask to the model become more and more general and more and more open ended. Now I maybe just say, build me a website that does x, While before in the in the past, I would just be like, hey. Like, is there a specific bug in this in in this, like, implementation that you have? And it's, like, much easier to say whether there's a bug because I I can I can extract I can know a pro I can have a human that says here are all the bugs that you have, and then you can apply that automatically? While the the website one is very hard to know, what is, like, the optimal answer because there are many good answers.
随着模型变得越来越强,evaluation(评估)也变得越来越难。这是因为我们交给模型的任务越来越通用,也越来越 open-ended(开放式)。现在我可能只会说,给我做一个实现某个功能的网站;而以前,我可能只是说,嘿,这个实现里有没有某个特定 bug?而判断有没有 bug 就容易得多,因为我可以提取出来、可以明确知道问题,也可以让一个人类告诉你“这里是你所有的 bug”,然后你就能自动化应用这个标准。但“做一个网站”这类任务就很难知道什么才是最优答案,因为好的答案有很多种。
Speaker 1 | 1:01:16 - 1:02:00 There are many good ways of of building a certain website. This open ended nature of models really makes evals harder. There's also another issue, that models in specific axes are becoming better than the majority of humans, and so we have fewer and fewer humans that can actually evaluate these models on particular axes, so that's definitely a constraint. Another one, to be honest, is kind of cultural. Most people want to improve the model, and they think that the best way to do that is kind of training the model, when in reality finding issues and making sure that we can quantify improvements is just as important, if not more important, but there's always this, like, cultural gap.
构建某个网站本来就有很多好的方式。模型这种 open-ended 的特性,确实让 evals 更难。还有另一个问题:模型在某些特定维度上已经比大多数人类更强了,所以真正能在这些特定维度上评估模型的人越来越少,这显然是一个限制。还有一个原因,老实说,有点是文化上的。大多数人都想提升模型,他们认为最好的方式就是去训练模型;但实际上,发现问题并确保我们能够量化改进,同样重要,甚至可能更重要。不过这中间一直存在一种文化上的落差。
Speaker 1 | 1:02:01 - 1:02:30 That was especially true, I would say, the academic world up to, like, two years ago when evals were always fixed, benchmarks were always fixed, and even datasets were kind of always fixed, maybe, let's say, four years ago. And there was, like, a mentality shift of, like, okay. Data is actually critical, and now there's a lot of people working on data. And I think evals were still not quite there. People don't really fully everyone knows that it's important, but, like, people don't really understand, like, how impactful it could be to work on emails.
我会说,这一点在 academic world(学术界)里尤其明显,至少在两年前还是这样:那时 evals 往往是固定的,benchmarks(基准)往往是固定的,甚至 datasets(数据集)在更早些年——比如四年前——也基本是固定的。后来大家的观念发生了转变,开始意识到:数据其实至关重要。现在已经有很多人在做数据相关工作了。但我觉得 evals 这件事还没有真正跟上。大家都知道它重要,但人们并没有真正理解,做 evals 这件事可能会产生多么大的影响。
Speaker 1 | 1:02:30 - 1:02:45 So actually, my first first product at OpenAI just came in, I was like, wanna work on data and emails because I know that this is the thing that no one is is working on. And as a result, I know that it's, like, super impactful to work on that. And, yeah, the tide is shifting, but like not fast enough.
所以其实,我刚加入 OpenAI 时做的第一个产品方向,就是想做 data 和 evals,因为我知道这是几乎没人做的事情。因此我也知道,做这件事的影响会非常大。是的,风向正在变化,但变化得还不够快。
Speaker 2 | 1:02:46 - 1:03:01 And is the pace of progress in model as a judge and AI evaluating AI, is that is that moving as fast? Is that a distinct part of research or, is that fundamentally the same idea or same techniques?
那么,在把 model 当作 judge,以及用 AI 评估 AI 这件事上,进展速度也一样快吗?这是一个独立的研究方向,还是说它在根本上是同一个想法、同一套技术?
Speaker 1 | 1:03:01 - 1:03:34 It's really fundamentally the same method. It's like nothing also, most of the things that we do in in evals, especially now that we have reinforcement learning, could just be applied nearly exactly as is during training. So that's another reason actually why evals are so complicated is that every time you build an eval, you actually build a way to build training datasets. So now you're gonna optimize that training dataset. Well, not even if it's not that eval, it's gonna be the same type of data, and now you're gonna do super well because we have this generalization of of of capabilities that I was telling you about.
从根本上说,这其实是同一种方法。基本上没有什么不同,而且我们在 evals(评测)里做的大多数事情,尤其是现在有了 reinforcement learning(强化学习)之后,几乎都可以原封不动地应用到训练阶段。所以这其实也是为什么 evals 会这么复杂的另一个原因:因为你每做出一个 eval,本质上也就在构建一种生成训练数据集的方法。接着你就会去优化这个训练数据集。即使不一定是专门针对那个 eval,数据类型也会是同一类,然后模型就会表现得特别好,因为我们具备了我刚才跟你说的那种能力泛化。
Speaker 1 | 1:03:34 - 1:04:19 You will learn that on that other dataset, and now you'll become really good at that eval, that eval will become obsolete really quickly. So so that's also an issue with ULs. But, yeah, to come back to your question, model as a judge, it's really important, and I think it's one one probably of the most important things because as we get, like, better models, we have this self reinforcing loop, and we have this this, like, capability flywheel where better models become better teachers for other models. And this is really important for training, but then you can also do the same thing for evaluation. So I a lot of my team works on that, and I think that's really critical is to work on this, module model as a as a judge kind of, framework.
你会在另一个数据集上学到这种能力,于是就会在那个 eval 上变得非常擅长,而那个 eval 也会很快过时。所以这也是 ULs 的一个问题。不过,回到你的问题上,model as a judge 真的非常重要,我认为它大概是最重要的事情之一。因为随着模型变得越来越强,我们会形成一种自我强化的循环,也就是一种能力飞轮:更好的模型会成为其他模型更好的老师。这对训练非常重要,但同样的事情你也可以用在评估上。所以我的团队里有很多人在做这个,我认为围绕这种 module model as a judge 的 framework(框架)去开展工作是非常关键的。
Speaker 2 | 1:04:21 - 1:04:40 Okay. Fantastic. Alright. So as we get, towards the end of this conversation, I'd love to to zoom out a bit and get your sense for where things, might be heading. Obviously, it's incredibly hard to make predictions on on AI, you know, years out, but let's call it the next twelve, eighteen, maybe twenty four months.
好的,太棒了。那随着这场对话接近尾声,我想把视角稍微拉远一点,听听你对未来走向的判断。显然,要对 AI 几年后的情况做预测是极其困难的,但我们不妨把范围限定在接下来的 12、18,或者 24 个月。
Speaker 2 | 1:04:40 - 1:04:49 Is your sense that things are going to continue progressing, or are we heading towards something that could feel more like a discontinuity?
你的感觉是,事情会继续这样逐步推进,还是说我们正走向某种更像“不连续跃迁”的状态?
Speaker 1 | 1:04:49 - 1:05:18 In terms of progress, as I was saying before, it's I think it's always continuous. Now the feeling of this continuity will happen. It did happen three months ago with coding or four months ago with coding, and I think that will happen now in every other domains. Like most people are not feeling the same way, the capability of our model and the usefulness of our models, the same way as coding and software engineering is feeling right now. So this will definitely permeate, I think, through many other verticals.
就进展而言,像我之前说的,我认为它总体上一直是连续的。不过,人们对这种连续性的感受会在某些时刻突然出现。三个月前,或者四个月前,在 coding(编程)领域就发生过这种情况;而我认为,接下来这种感受也会出现在其他各个领域。现在,大多数人对我们模型的能力,以及这些模型的实用价值的感受,还不像 coding 和 software engineering(软件工程)领域此刻这么强烈。所以我认为,这种影响肯定会渗透到许多其他 verticals(垂直行业)中。
Speaker 1 | 1:05:19 - 1:05:48 Now in terms of capability bump in terms of, let's say, verticals that we're already looking at, I think it will be more continuous, and they will there will never be, big discontinuities. Like, most of them are always local discontinuities, but you zoom out and it always just feels pretty smooth. It's not always like this, but, like, that has been the case most of the time, and I can definitely not predict when is the next big discontinuity.
至于能力上的跃升,如果看的是我们已经在关注的那些 verticals(垂直领域),我认为它会更连续,不会出现真正巨大的不连续。大多数所谓的不连续,其实都只是局部的不连续;但如果你把视角拉远来看,整体感觉始终还是相当平滑。当然也不总是这样,不过大多数时候确实如此,而我肯定也无法预测下一次大的不连续会在什么时候出现。
Speaker 2 | 1:05:48 - 1:06:09 What is your sentiment on this general concept of, accelerating loops in AI? So whether that's continual learning to make models more current and able to learn faster to this broader concept of AI building AI, like in an increasingly automated way, fact versus fiction. And what are you excited about?
你怎么看 AI 中这种“加速循环”的总体概念?比如 continual learning(持续学习),让模型保持更强的时效性并且学习得更快;再到更广义上的 AI building AI,也就是以越来越自动化的方式由 AI 来构建 AI。这到底更接近事实还是科幻?以及,哪些方面最让你兴奋?
Speaker 1 | 1:06:09 - 1:06:48 I'm extremely excited about continual learning. I think we haven't quite cracked it. I mean, we we have, like, codex memories, and that that is helpful, but it's definitely not, like, the the end state. I have a friend who always like tells me about, again, another type of plot that we should be looking at, which is x axis time, y axis utility that you provide to users. And right now, or like or like usefulness basically of the models, and right now actually most models, at day zero, if you just drop them in a company, arguably they are more useful than most new employees.
我对 continual learning(持续学习)非常兴奋。我觉得我们还没有真正把它攻克。我的意思是,我们现在有点像 codex memories 这样的东西,这确实有帮助,但肯定还不是最终形态。我有个朋友总会跟我提,我们还应该关注另一种图:x 轴是时间,y 轴是你给用户提供的 utility(效用)。或者说,本质上就是模型的 usefulness(有用性)。而现在其实大多数模型,如果你在第零天直接把它们放进一家公司里,可以说它们已经比大多数新员工更有用。
Speaker 1 | 1:06:48 - 1:07:35 So they start higher at t zero, but then across time they are mostly constant. They don't really learn company knowledge, they don't really learn to be more efficient over time on doing the things that they are doing, while humans learn really quickly. What is important is this integral area under the curve of these curves, and as a result, I think, like, humans are still more useful, in many cases, and that's why what we will need is to make like, continual learning is to make the the this curve now monotonically increasing over time, and basically make models more and more useful the longer they work in a certain environment. So I'm extremely excited about it. I'm actually surprised that we're not quite there yet.
所以它们在 t zero 时起点更高,但随着时间推移,基本上是恒定的。它们并不会真正学会公司的知识,也不会随着时间推移在自己所做的事情上变得更高效;而人类学习得非常快。重要的是这些曲线下面积所代表的积分面积,因此我认为,在很多情况下,人类依然更有用。这也是为什么我们需要做到 continual learning:让这条曲线随着时间单调上升,基本上让模型在某个环境里工作得越久,就变得越有用。所以我对此非常兴奋。实际上,我也很惊讶我们到现在还没有真正做到这一点。
Speaker 1 | 1:07:37 - 1:08:00 Three years ago when ChatGPT came out, I remember I was doing a startup with friends, and we were thinking about working on continual learning and personalization and memories in general. We're like, OpenAI is going to do that in the next six months. They have all the data, they're to figure it out, and they have all the users and their models are going to run super quickly from users. And three years later, I don't think we're there yet.
三年前 ChatGPT 刚出来的时候,我记得我在和朋友一起做一家 startup,当时我们就在考虑做 continual learning、personalization(个性化)以及更广义上的 memories(记忆)。我们当时想,OpenAI 接下来六个月肯定就会把这个做出来。他们有所有数据,他们会把它搞定,而且他们有所有用户,他们的模型会非常快地从用户那里学到东西。但三年过去了,我觉得我们还没有走到那一步。
Speaker 2 | 1:08:01 - 1:08:06 And quickly, in layman's terms, is the fundamental difficulty?
那快速用外行能懂的话来说,根本性的困难是什么?
Speaker 1 | 1:08:06 - 1:08:40 It's a good question. I actually don't quite know, to be completely honest with you. I don't quite know why it's taking us that long to figure it out. It's this type of domain that I think if we really put enough resources behind it, we would figure it out. Of course, when we talk about this memory inside of a company, there's big questions about permissions, and there's a lot of questions about privacy and, like, what you can share and what you cannot, across models across users, sorry.
这是个好问题。老实说,我其实并不完全知道。我不太清楚为什么我们花了这么久还没把它解决。这类领域我觉得如果我们真的投入足够多的资源,是可以搞明白的。当然,当我们谈论公司内部的这种 memory(记忆)时,就会有很大的 permissions(权限)问题,也有很多关于 privacy(隐私)的问题,比如哪些东西可以分享,哪些不可以,在模型之间、用户之间——抱歉,是在用户之间。
Speaker 1 | 1:08:40 - 1:08:49 But for a single user, even for a single user, we're not quite there, and I I don't quite know why. At least at at the at the high level that I can talk about, I don't know why.
但即便是针对单个用户,哪怕只是单个用户,我们也还没真正做到,而我也不太知道为什么。至少在我现在这个能谈的高层次上,我不知道原因。
Speaker 2 | 1:08:50 - 1:09:22 Yeah. What you bring up is I think really interesting for AI builders and investors and startups, which is this question of the models getting increasingly smarter within an enterprise. And in particular, there's like this whole tension between what the models are able to do and then what a lot of people have built around the model. So, you know, a year or two ago, was RAG. These days, it's all about harnesses for agents.
对。你提到的这一点,我觉得对 AI builder(AI 构建者)、investor(投资人)和 startup 来说都非常有意思,也就是模型在企业内部是否会变得越来越聪明这个问题。尤其是,这里面一直存在一种张力:一边是模型本身能做什么,另一边是很多人围绕模型额外搭建出来的东西。比如你知道,一两年前流行的是 RAG;这些天大家谈的都是给 agents(智能体)配套的 harnesses(控制框架/运行框架)。
Speaker 2 | 1:09:22 - 1:09:35 And, a lot of people are wondering whether the models are going to end up eating the harness, whether the harness is just a temporary thing. From your perspective, like, where where what do you think happens?
而且很多人都在想,模型最终会不会把 harness 吃掉,也就是 harness 会不会只是一个临时性的东西。从你的视角看,接下来会发生什么?
Speaker 1 | 1:09:35 - 1:10:24 Yeah. I think harnesses can really improve the capability of a model right now. I think given that we're seeing this this really fast progress in terms of capability, I personally wouldn't push that much on the harness unless the harness is something for a very concrete goal that you're trying to achieve right now. Certain companies, if they are focused on a specific vertical, they want to go from this 80% reliability to 85%, and Iconicists will give them that. I think that's very important, but they need to do it while knowing that they will have to retune that harness in the future, and I think that's totally fine.
对。我觉得 harness 现在确实能显著提升模型的能力。考虑到我们正在看到能力层面非常快速的进展,就我个人而言,除非这个 harness 是为了你当下想实现的一个非常具体的目标,否则我不会在 harness 上投入太多。某些公司如果专注于某个特定 vertical(垂直领域),它们想把可靠性从 80% 提高到 85%,而 Iconicists 能帮它们做到这一点。我觉得这非常重要,但它们需要在做这件事时清楚:未来还得重新调这个 harness,而我觉得这完全没问题。
Speaker 1 | 1:10:26 - 1:11:23 If you try to have a general harness that will sustain over time, I don't think that will work. Harnesses for specific domains as a short term thing that you need to do, I think there will always be so much you can do in harnesses, and if anything, I think everyone should do more of that if they have a specific problem in mind because we're leaving so much on the table without a good harness. Arguably, if we just I think if we froze the models that we have right now and you really worked on the harness and maybe we also spend more time training with a great harness, I think people would really feel the AGI in every single domain or could already feel that in every single domain, But given that we're not freezing it and we're gonna continue training better and better models, I think the harness we don't really understand what the final harness will be, and it's not and, like, it will always change.
如果你想做一个能长期持续适用的通用 harness,我觉得那行不通。针对特定领域的 harness,作为你短期内必须做的事情,我认为永远都会有很多可做的空间;如果真要说,我觉得只要心里有一个具体问题,每个人都应该在这方面做得更多,因为没有好的 harness,我们其实浪费了很多潜力。甚至可以说,如果我们把现在手头这些模型冻结住,然后你真的把精力放在 harness 上,也许再花更多时间用一个很棒的 harness 来训练,我觉得人们会在每一个领域都真正感受到 AGI,或者说已经能在每一个领域感受到它了。但鉴于我们并不会把模型冻结住,而是会继续训练越来越好的模型,我觉得对于 harness,我们其实并不真正知道最终形态会是什么;而且它也会一直变化。
Speaker 2 | 1:11:23 - 1:12:09 Same question about applications. So we alluded to, your progress in in different verticals, and, that was, you know, GDBI Val in general, but also Tao2bench Telecom, which does complex customer service workflows, and then progress against finance agents, 88.5% of internal investment banking modeling tasks and then 51.1% on Office QA Pro. So bit by bit, you're doing more and more of this. So do you think people should be building, applications, anymore or is ultimately as we get closer to AGI, all of this going to be part of the model capabilities?
关于应用(applications)我也有同样的问题。我们刚才提到过你们在不同 verticals(垂直领域)里的进展,比如总体上的 GDBI Val,还有 Tao2bench Telecom——它处理复杂的客户服务工作流;然后还有在 finance agents(金融 agent)上的进展:完成了 88.5% 的内部投资银行建模任务,以及在 Office QA Pro 上达到了 51.1%。所以你们是在一点一点做越来越多这类事情。那么你觉得,人们现在还应该继续构建 applications 吗?还是说,随着我们越来越接近 AGI,这一切最终都会成为模型能力本身的一部分?
Speaker 1 | 1:12:10 - 1:12:59 There's so much space on pushing for like external companies or like startups pushing on specific verticals. I think there's so much space for that. The reason why is because, a lot of people kind of think about intelligence in quotations and kind of raw capability as being the real bottleneck, but I don't think that's true. I think most of the time, the bottleneck is the last mile. It's making sure that the model has the right permissions or has also access to the right connectors and things like this, and we are going to be very focused on this on this general aspect, and I think there are other companies that should be focused on more the verticals and providing maximum value of what we currently have.
在推动特定 vertical(垂直领域)这件事上,外部公司或者 startups 还有非常大的空间,我觉得这方面空间非常大。原因在于,很多人会把带引号的“智能”和某种原始能力看成真正的瓶颈,但我不觉得这是事实。我认为大多数时候,瓶颈其实在最后一公里。也就是要确保模型拥有正确的权限,或者也能访问到正确的 connectors(连接器)以及诸如此类的东西。我们会非常专注于这种更通用层面的事情,而我认为其他公司则应该更多专注于 verticals,并把我们当前已有能力的价值最大化。
Speaker 1 | 1:13:00 - 1:13:21 So I think there will always be a lot of space left for this last mile in different verticals, and I would highly encourage people to continue working on that. And maybe one day when we stop making horizontal progress, which I don't think is anytime soon, maybe we will stop focusing on that. But, yeah, that's not what we're doing now.
所以我觉得,在不同 verticals(垂直领域)里,这种最后一公里的问题始终都会留下很大的空间,我会非常鼓励大家继续做这方面的工作。也许有一天,当我们不再取得 horizontal progress(横向进展)时——不过我觉得那一天短期内还不会到来——也许我们才会不再聚焦这件事。但对,至少现在我们不是这么做的。
Speaker 2 | 1:13:22 - 1:13:31 Okay. Well, feels like a very optimistic note, at least for the startup ecosystem to end up on. Thank you so much, Jan. This was terrific. Really enjoyed it.
好的。至少对 startup ecosystem(创业生态)来说,这听起来像是一个非常乐观的收尾。非常感谢你,Jan。这次交流太棒了。我真的很享受。
Speaker 2 | 1:13:31 - 1:13:33 Thank you so much for spending time with us.
非常感谢你抽时间和我们聊。
Speaker 1 | 1:13:33 - 1:13:35 Great, thanks Matt!
太好了,谢谢你,Matt!
Speaker 2 | 1:13:36 - 1:13:56 Hi, it's Matt Turk again. Thanks for listening to this episode of the MAD Podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode from. This really helps us build a podcast and get great guests. Thanks and see you at the next episode!
嗨,又是 Matt Turk。感谢你收听这一期 MAD Podcast。如果你喜欢这期内容,而你还没有订阅的话,我们会非常感激你考虑订阅,或者在你观看或收听这一期节目的平台上留下积极的评价或评论。这真的能帮助我们把这个 podcast 做起来,并邀请到很棒的嘉宾。谢谢,我们下期节目再见!