🎙 播客The MAD Podcast with Matt Turck· 2026 年 6 月 4 日· 9,298 词 · 约 46 分钟

OpenAI's Dan Roberts: Why AI Can Now Make Discoveries

SPACE 播放 / 暂停←→ 上一句 / 下一句

Speaker 100:00 - 00:23

One of the things that CHAT GPT was able to do was assume it was false. When you go against the grain and do something contrarian like that, you really have to have strong conviction in what you're doing in order to persevere down a really long calculation path. I feel really excited that we will get to really answer a lot of fundamental questions in the field of science that that we care about with the aid or the models being the driving force. And so that's just really thrilling.

Speaker 100:00 - 00:23

CHAT GPT 能做到的一件事，是先假定它是错的。当你逆势而行，做这种 contrarian（反共识）的事时，你真的必须对自己在做的事情有非常强的信念，才能沿着一条非常漫长的计算路径坚持下去。我感到非常兴奋的是，在 models（模型）的帮助下，而且由模型作为主要驱动力，我们将真的能够回答很多我们关心的科学领域中的基础性问题。这实在令人非常振奋。

Speaker 200:23 - 01:03

Hi. I'm Matt Turk. Welcome to the Matt podcast. It's been yet another extraordinary last few days in AI with OpenAI, DeepMind, and Anthropic cracking some of the most famous long unsolved questions in mathematics known as the Erdosch problems, a moment many view as a stunning breakthrough and yet another signal that AI is moving from doing the work we ask of it to autonomously making deep science discoveries. To unpack the moment and the fundamental advances in model reasoning that make it possible, I'm excited to welcome Dan Roberts, a top AI researcher at OpenAI who comes from a deep background in theoretical physics and has a particular interest in the intersection of science and AI.

Speaker 200:23 - 01:03

大家好，我是 Matt Turk。欢迎来到 Matt podcast。过去几天，AI 领域再次经历了非同寻常的时刻：OpenAI、DeepMind 和 Anthropic 攻克了数学中一些最著名、长期未解的问题，也就是所谓的 Erdosch problems。很多人把这一刻视为一次惊人的突破，也把它看作又一个信号：AI 正在从完成我们交给它的工作，转向自主做出深层次的科学发现。为了拆解这一时刻，以及使之成为可能的模型推理能力中的根本性进展，我非常高兴邀请到 Dan Roberts——OpenAI 的顶尖 AI 研究员，他有深厚的 theoretical physics（理论物理）背景，并且特别关注 science 与 AI 的交叉领域。

Speaker 201:03 - 01:20

In this conversation, we go deep on what reinforcement learning is, it's the most important paradigm in AI right now, and what's ahead for AI and science. Please enjoy my conversation with Dan Roberts. Hey, Excited to do this. Thanks for taking the time.

Speaker 201:03 - 01:20

在这段对话中，我们会深入讨论 reinforcement learning（强化学习）到底是什么——它是当下 AI 中最重要的范式——以及 AI 与 science 的未来会走向何方。请欣赏我与 Dan Roberts 的这段对话。嘿，很高兴做这次访谈。感谢你抽时间来。

Speaker 101:20 - 01:21

Course. Very happy to be here.

Speaker 101:20 - 01:21

当然。很高兴来到这里。

Speaker 201:21 - 01:31

You are the lead of the foundations of reinforcement learning team at OpenAI. So what what does that mean? What does the name mean?

Speaker 201:21 - 01:31

你是 OpenAI 的 foundations of reinforcement learning team 的负责人。所以，这具体是什么意思？这个名字是什么意思？

Speaker 101:31 - 02:01

The larger team that that we're on is called foundations, and we think about reinforcement learning. So very boring foundations of reinforcement learning. But the team comes from a mandate of of thinking about the science of reinforcement learning. And a long time ago, which in AI speak is like six months ago, maybe a year, I guess now two years. So before we released a one and thinking reasoning models, we were studying this internally.

Speaker 101:31 - 02:01

我们所在的更大团队叫做 foundations，而我们思考的是 reinforcement learning，所以就成了一个非常朴素的名字：foundations of reinforcement learning。不过这个团队的出发点，是去思考 reinforcement learning 的科学问题。很久以前了——当然，用 AI 的时间尺度来说，“很久以前”大概就是六个月，也许一年，不过现在大概算两年了。总之，在我们发布 a one 和 thinking reasoning models 之前，我们就在内部研究这些东西了。

Speaker 102:01 - 02:31

And one of the advantages to being first or at least to being forced and and spending a lot of resources on scaling things up is that you can empower a group of people to not just work on making the thing work, but work on understanding how it works. And then beyond that, how do we scale? How should we think about scaling reinforcement learning versus scaling pretraining? So what what are scaling laws look like? But then going beyond that, what what sort of things does this kind of training teach us?

Speaker 102:01 - 02:31

而率先做这件事的一个优势——或者至少，被迫先做、并且投入大量资源去把规模扩起来的一个优势——在于，你可以让一群人不只是去研究怎样把这个东西做出来、让它运转起来，还可以去研究它到底是怎样运作的。再进一步，规模该怎么扩？我们应该怎样看待 scaling reinforcement learning，与 scaling pretraining 相比又该如何理解？也就是说，scaling laws（缩放定律）会是什么样子？再往前一步，这种训练到底会教会我们什么？

Speaker 102:31 - 02:56

What doesn't it teach us? We're very interested in at the frontier for exploratory scenarios. How do we either improve or understand better what reinforcement learning is doing? We have all this compute that famously we are in the process of of of acquiring, and we would like to turn that compute into intelligence. And to do that, we need to make thinking models.

Speaker 102:31 - 02:56

它又不会教会我们什么？在前沿的 exploratory scenarios（探索性场景）中，我们非常感兴趣的是：如何改进，或者更好地理解 reinforcement learning 到底在做什么。我们拥有这些众所周知正在不断获取中的 compute（算力），而我们希望把这些 compute 转化为 intelligence（智能）。而要做到这一点，我们就需要构建 thinking models。

Speaker 102:56 - 03:08

And some somewhere along the way, we interact with that process usually at the earlier stage, for models, you know, not the next model, but things that are like the next model or the next next model.

Speaker 102:56 - 03:08

而且在这条路径上的某个阶段，我们会介入那个过程，通常是在较早的阶段，针对那些模型——不是下一个模型本身，而是类似于下一个模型、或者下下个模型的东西。

Speaker 203:08 - 03:16

Great. And, quickly, what was your path to OpenAI? So how did you go from studying physics to being where you are today?

Speaker 203:08 - 03:16

很好。那快速问一下，你是怎么来到 OpenAI 的？也就是说，你是如何从学习 physics（物理）走到今天这个位置的？

Speaker 103:16 - 03:45

I did a PhD in in theoretical physics, from from MIT, thinking about the intersection of quantum gravity and quantum information, thought a lot about black holes and quantum chaos kind of thing of what if you throw something into a black hole? What happens to the information? Does it does it come out? How if we think about black holes as computers, how fast are they? I was very interested in this fundamental question in theoretical physics, which is how do you find a quantum theory of gravity?

Speaker 103:16 - 03:45

我在 MIT 读了 theoretical physics（理论物理）PhD，研究的是 quantum gravity（量子引力）和 quantum information（量子信息）的交叉领域，思考了很多关于 black hole（黑洞）和 quantum chaos（量子混沌）之类的问题，比如如果你把某样东西扔进黑洞，会发生什么？信息会怎样？它会出来吗？如果把黑洞看成 computer（计算机），它们有多快？我当时对 theoretical physics 里的一个根本问题非常感兴趣，也就是：怎样找到一个 quantum theory of gravity（量子引力理论）？

Speaker 103:46 - 04:13

I also got very interested in this inter interplay between computation and the laws of physics. You know, any computer exists in the universe in in, you know, behaves according to physical law. So the sort of computations you can do are bounded by the laws of physics, and there's some sort of interesting relationship there. Black holes are pretty interesting because they sort of saturate some conjectured bounds around processing of of information. And from there, I did a postdoc at the Institute for Advanced Study.

Speaker 103:46 - 04:13

我也对 computation（计算）与 physics（物理）定律之间的这种相互作用非常感兴趣。你知道，任何 computer（计算机）都存在于宇宙中，并且按照物理定律运行。所以，你能进行的计算种类其实受到物理定律的约束，而这里面存在某种有趣的关系。black hole（黑洞）很有意思，因为它们某种程度上达到了关于 information（信息）处理的一些猜想性上界。之后，我去了 Institute for Advanced Study 做 postdoc（博士后）。

Speaker 104:14 - 05:03

And at around that time, I'm pretty old now for at least for this field, so that was about 2016, when the DQN Atari paper from DeepMind happened in 2015 and then AlphaGo was in 2016. And I got very excited about these about the possibility of machine learning and then and then deep learning was statistical science that lived in a similar framework to the sort of frameworks that we use to, like, study the rest of the universe. And so there's always this question of of, you know, how does everything work? This three year old question of I'm curious about about everything. And if you look outward and you you care enough, you end up philosophy maybe, if you're quantitative, you end up in in in in physics, very, very crude characterization.

Speaker 104:14 - 05:03

也就在那个时候，至少在这个领域里我现在已经算挺老了，所以那大概是 2016 年。当时 DeepMind 在 2015 年发表了 DQN Atari paper，接着 2016 年有了 AlphaGo。我对 machine learning（机器学习）的可能性变得非常兴奋，后来又对 deep learning（深度学习）很着迷，因为它是一门 statistical science（统计科学），存在于一种与我们用来研究宇宙其余部分的框架相似的框架里。所以始终都有这样一个问题：一切究竟是如何运作的？这是一种像三岁小孩一样“我对一切都好奇”的问题。如果你向外看，并且足够在意，也许最后会走向 philosophy（哲学）；如果你更偏 quantitative（定量），那你可能会走向 physics（物理）——当然，这是非常粗略的概括。

Speaker 105:03 - 06:24

An AI that works is very fascinating, Or AI systems that work is are fascinating because they are simple examples that do things that humans do. And then if it lives in the same framework that we use to understand everything else, then you know, it's sort of like, you can, you can draw the parallels between how does the universe work and how do I work or how does intelligence work. And so I got extremely interested in, in AI and deep learning, then I went to fair Facebook's AI research lab at around '27 2017 to basically try and use the tools from theoretical physics to understand deep learning. Deep learning was supposed to be this really difficult thing that you couldn't understand, and I thought maybe the tools of physics could could be helpful. This actually culminated in a book that I wrote with a collaborator who now is cloud still a collaborator of mine now at at OpenAI working on the same thing as Shoyeda, but we wrote this book, The Principles of Deep Learning Theory, that was a culmination of these sets of ideas of can we can we sort of use the statistical ideas of of of, you know, understanding statistical statistical systems like the gas in the room, we can characterize them with some simple laws of thermodynamics, the ideal gas law and, and maybe we can make similar progress in understanding deep deep network.

Speaker 105:03 - 06:24

能真正起作用的 AI 非常迷人，或者说，能够起作用的 AI systems（AI 系统）非常迷人，因为它们是一些简单的例子，却能做出人类会做的事情。而如果它又处在与我们理解其他一切相同的框架中，那么你就可以在“宇宙如何运作”和“我是如何运作的”或者“intelligence（智能）如何运作”之间建立类比。所以我对 AI 和 deep learning 产生了极大的兴趣，随后在大概 2017 年去了 fair，也就是 Facebook 的 AI research lab，基本上是想尝试用 theoretical physics（理论物理）的工具来理解 deep learning。大家原本觉得 deep learning 是一种非常困难、无法理解的东西，而我觉得 physics（物理）的工具也许会有帮助。最终，这件事发展成了我和一位合作者一起写的一本书——他现在在 OpenAI，仍然是我的合作者，做的事情和 Shoyeda 一样——我们写了这本 The Principles of Deep Learning Theory。这本书汇集了这一系列想法：我们能不能用理解 statistical systems（统计系统）的统计学思路——比如房间里的气体，我们可以用一些简单的 thermodynamics（热力学）定律、ideal gas law（理想气体定律）来刻画它——也许我们也能在理解 deep network（深度网络）方面取得类似的进展。

Speaker 106:24 - 07:01

So that was sort of my transition. I also had a startup along the way, and spent some time at Sequoia Capital as an entrepreneur in residence. So there's some tension between am I a scientist and an entrepreneur. But about two years ago, after thinking about whether I want to start another AI company, I realized that the thing that was most exciting right now was what was happening at the frontier that there's some some some amazing scientific progress happening in AI and to really get at the questions and understand what's going on, you need to be there and you need to participate. And that meant joining lab.

Speaker 106:24 - 07:01

这大概就是我的转变过程。期间我也做过一家 startup，还在 Sequoia Capital 做过 entrepreneur in residence。所以我心里一直有一种张力：我到底是 scientist（科学家）还是 entrepreneur（创业者）？但大约两年前，在思考自己是否还想再创办一家 AI 公司之后，我意识到，当下最令人兴奋的事情，其实是 frontier（前沿）上正在发生的事：AI 领域里正在出现一些惊人的科学进展，而如果你真的想触及这些问题、理解其中到底发生了什么，你就得身处其中，你必须参与进去。而这就意味着加入 lab。

Speaker 107:01 - 07:04

So I joined OpenAI two years ago.

Speaker 107:01 - 07:04

所以我在两年前加入了 OpenAI。

Speaker 207:04 - 07:27

Great, thank you for that. Where do you think we are in the evolution of AI being increasingly able to solve difficult scientific problems. I mean, it's certainly something that we've been talking as an industry about for a while now, but it seems to be accelerating perhaps just like everything else in in AI. But where do you think we are?

Speaker 207:04 - 07:27

很好，谢谢你这么说。你觉得，在 AI 越来越能够解决困难科学问题这条演化路径上，我们现在处在什么阶段？我的意思是，这当然是整个行业讨论了一段时间的话题了，但现在看来它似乎正在加速，也许就像 AI 里的其他一切一样。那么你觉得我们现在到哪一步了？

Speaker 107:27 - 08:13

I think one of the interesting things is that this process is smooth. The there's no sharp point, or I don't think there will be a sharp point where we'll say that systems didn't weren't able to be useful for scientific the scientific process to their fully fledged scientists, there'll be sort of a gradual shift. If you had to point to one moment, maybe it would be the release of a one and by OpenAI and the sort of paradigm of test time compute and reasoning. But I'm sure if I tried to make that claim, you could go and look at gpd four and and see that there's glimpses of that sort of useful behavior for the scientific process. We're already present as a general point, know, the models are very good at certain types of things that clearly are amenable to to making progress in math.

Speaker 107:27 - 08:13

我觉得有意思的一点是，这个过程是平滑的。这里不存在一个明确的临界点，或者说，我不认为未来会有那样一个明确时刻，让我们可以说系统之前还不能对科学研究过程产生作用，而之后就 suddenly 变成了成熟完整的科学家；更可能出现的是一种渐进式转变。如果一定要指出一个时刻，也许会是 OpenAI 发布 o1，以及 test-time compute（测试时计算）和 reasoning（推理）这一范式的出现。但我也很确定，如果我真这么断言，你完全可以回头去看 GPT-4，并发现其中其实已经出现了一些对科学研究过程有用的行为迹象。更一般地说，这些模型在某些类型的任务上已经非常强，而这些任务显然是有助于推动数学进展的。

Speaker 108:13 - 08:21

They're not open loop, fully fledged scientists in in any domain, although, you know, neither am I. It seems like it's just this really nice gradual process.

Speaker 108:13 - 08:21

它们还不是那种在任何领域里都能 open-loop（开环式）独立运作的、完整成熟的科学家，当然，你知道，我自己其实也不是。但这看起来确实就是一个很不错、很自然的渐进过程。

Speaker 208:21 - 09:05

So it feels like a particularly fun week to be having this conversation because over the last few days there were a number of different announcements in the general field of AI and mathematics around the Erdos problems. OpenAI came out first with this progress, but like almost within a few hours Google DeepMind had a claim as well on different problems, then Anthropic had some claims. However, from what I understand, the OpenAI approach and the DeepMind approach were were very different, and that may be very interesting in terms of what that means for AI as a research scientist.

Speaker 208:21 - 09:05

所以感觉这周特别适合来聊这个话题，因为就在过去几天里，在 AI 与数学这个大方向、尤其是围绕 Erdos problems 的领域中，出现了好几个不同的公告。OpenAI 率先公布了这项进展，但几乎就在几小时之内，Google DeepMind 也针对不同问题提出了自己的说法，接着 Anthropic 也有一些相关声明。不过据我理解，OpenAI 的方法和 DeepMind 的方法非常不同，而这可能会很有意思，因为它关系到这对 AI 作为 research scientist（科研型科学家）意味着什么。

Speaker 109:05 - 09:35

This conjecture everyone assumed was true and but could not prove it. One of the things that CHECH EPT was able to do was assume it was false. And when you go against the grain and and do something contrarian like that, you really have to have strong conviction in what you're doing in order to persevere down a really long calculation path. Because there's a lot of choices that you can make along along the path. And if you get any of those choices wrong, if you if your ideas don't work, then you find out that that you didn't make any progress.

Speaker 109:05 - 09:35

这个 conjecture（猜想）大家原本都认为是真的，但就是无法证明它。CHECH EPT 能做到的一件事，是先假设它是假的。而当你逆着主流、采取这种反常规的做法时，你必须对自己在做的事情有非常强的信念，才能在一条特别漫长的计算路径上坚持下去。因为在这条路径上，你会面临很多可选的分支。如果其中任何一个选择错了，或者你的想法行不通，那么最后你就会发现自己根本没有取得任何进展。

Speaker 109:35 - 10:09

And so you need this really strong persistence. And then you need expertise in this other field, let's which is like algebraic number theory, some sort of generalization of number theory on, you know, things that sort of generalize the the integers and the real numbers, you go down that path really far, you can refute this conjecture. So that was the big result. The big result was that this, this conjecture of this lower bound for the number of pairs that you can make is false. Not only is it false, it was false due to a really interesting connection to another field of mathematics.

Speaker 109:35 - 10:09

所以你需要非常强的坚持力。然后你还需要另一个领域的专业知识，比如 algebraic number theory（代数数论），也就是某种对 number theory（数论）的推广，研究的是一些可以看作对整数和实数的推广对象。沿着那条路深入走下去，你就能反驳这个 conjecture。那就是那个重大结果。真正的重大结果是：这个关于可构造配对数量下界的 conjecture 是假的。不仅是假的，而且它之所以是假的，是因为它与数学中另一个领域存在一个非常有意思的联系。

Speaker 110:09 - 10:25

And so you would have to be somebody who is aware of this problem as interesting, which sounds like, you know, your expertise is one thing, and then be an expertise in something else, and then also be super contrarian and go down this really long path. And and then you could you would you have identified the solution.

Speaker 110:09 - 10:25

所以，你必须是这样一种人：你先要意识到这个问题很有意思——这听起来像是你的一个专长；然后你还得在另一个方向上也具备专业能力；接着你还得非常反常规，愿意沿着这条特别漫长的路径一直走下去。这样一来，你才有可能识别出这个解法。

Speaker 210:25 - 10:34

The OpenAI approach and the DeepMind approach were were very different. Do you wanna compare and contrast the two approaches?

Speaker 210:25 - 10:34

OpenAI 的方法和 DeepMind 的方法非常不同。你愿意比较一下这两种方法的异同吗？

Speaker 110:34 - 11:25

One of the approaches that GDM takes is to take problems, present them in a formal language called Lean, and then used methods to search for proofs in in that language. And some problems for problems to be representable, there's this process called auto formalization, where you take English version of the problem and you translate it into rigorous formal statements, And then you you conduct your proofs there. And it's it's it's designed so that the proofs can be airtight, no one has to go and check for for some hidden assumption or some weird thing or I guess it's usually hidden assumptions or definitions that are not airtight. But in that setting, which is a setting that DeepMind has has cared a lot about, they were able to formalize some some problems and and use their use their system to prove them. So that's that's one approach.

Speaker 110:34 - 11:25

GDM 采用的一种方法，是把问题用一种叫作 Lean 的形式化语言来表述，然后在那种语言中用各种方法搜索证明。对于某些问题，要让它们能够被这样表述，还需要一个叫作 auto formalization（自动形式化）的过程：你把问题的英文版本翻译成严格的形式化陈述，然后再在那个体系里进行证明。这样设计出来的目的，是让证明可以做到 airtight（无懈可击）；不需要有人再去检查是否存在某个隐藏假设，或者某种奇怪的问题——通常就是那些不够严密的隐藏假设或定义。但在这种设定下，也就是 DeepMind 非常重视的一种设定里，他们确实能够把一些问题形式化，并用他们的系统把它们证明出来。所以这是一种路径。

Speaker 111:25 - 11:54

Another approach is to just take the problem in English with mathematical expressions as well, but just the English statement of it, which is informal, and understand what is meant by that and solve that in informal language, presenting a proof much like the way a human mathematician would or human mathematician who's not using Lean. And then you have to check it. It's the the verification problem is is is harder because it's not something that auto checks.

Speaker 111:25 - 11:54

另一种方法，是直接处理用英文表述、并附带数学表达式的问题；也就是说，只看它的英文陈述，这本身是非形式化的，然后去理解其中的含义，再用非形式化语言把它解出来，给出一种很像人类数学家——或者说不使用 Lean 的人类数学家——会写出的证明。然后你还得去检查它。因为这不是那种可以自动检查的东西，所以 verification（验证）问题会更难。

Speaker 211:54 - 11:56

And that second approach was OpenAI.

Speaker 211:54 - 11:56

而第二种方法就是 OpenAI 采用的。

Speaker 111:56 - 12:12

Most of our results that we publicize as far as I I can think are all in the informal setting. We have we have language models that we've taught them to reason at test time, and and one of the applications or benchmarks for that is reasoning in mathematics.

Speaker 111:56 - 12:12

据我所能想到的，我们公开发表的大多数结果基本都属于这种非形式化设定。我们有一些 language models（语言模型），我们训练它们在 test time（测试时）进行推理，而其中一个应用或 benchmark（基准测试）就是数学推理。

Speaker 212:13 - 12:35

Okay, great. All right. So let's get into reinforcement learning. To make this broadly accessible, let's start from the top. What is the one, two, three sentences definition, for reinforcement learning and perhaps give us a simple non technical analogy for people to understand?

Speaker 212:13 - 12:35

好，很好。那么我们来聊聊 reinforcement learning（强化学习）吧。为了让更多人都能听懂，我们先从最基础的说起。你能不能用一两三句话定义一下什么是 reinforcement learning，并且最好再给一个简单、非技术性的类比，帮助大家理解？

Speaker 112:35 - 12:56

Maybe a simple thing to do would be to to give you two examples of of how you could try to learn something, you you as an individual. And and maybe we can we can take a game or even a say say a video game. Right? I'm old enough that where I played the original eight bit Mario Brothers, the Super Mario Brothers. And so here are two ways you could learn how to play.

Speaker 112:35 - 12:56

也许一个简单的讲法，是给你举两个例子，说明一个人——也就是你自己——可以怎样去学习某件事。我们不妨用一个游戏，甚至就说一个 video game（电子游戏）来举例。对吧？我年纪够大，还玩过最初的 eight bit Mario Brothers，也就是 Super Mario Brothers。所以，下面是你学习怎么玩它的两种方式。

Speaker 112:56 - 13:15

One way you could learn how to play is your dad takes it out and plugs it in, and he boots up the game, and then he plays for a few hours. And then you you just watch him play. That's all you do. So he's demonstrating how to play. And then at the end of that, he you know, he and he's not very nice, so he doesn't let you play.

Speaker 112:56 - 13:15

第一种方式是：你爸爸把游戏机拿出来插上电，启动游戏，然后他自己玩上几个小时。而你就只是看着他玩，仅此而已。也就是说，他是在向你示范怎么玩。然后到最后呢，你知道的，他这个人也不怎么厚道，所以他根本不让你上手玩。

Speaker 113:15 - 13:30

And but then he, like, goes and runs outside and does something else. And, you know, you sneak into his room, you you plug it in, and you try to play. How how good are you going to be? Well, all you've done is tried to memorize what he's done. You haven't gotten to push any of the buttons yourself.

Speaker 113:15 - 13:30

但接着他会跑到外面去做别的事。然后，你就偷偷溜进他的房间，把机器插上，试着自己玩。那你会玩得有多好呢？其实你做过的事，无非就是努力记住他是怎么操作的。你自己从来没有亲手按过任何按钮。

Speaker 113:30 - 13:52

You haven't gotten to to interact with with the game yourself. This is sometimes called, expert demonstrations, and and, and, you know, you're you're you're just trying to memorize what someone else is doing. It's the version of supervised learning. But supervision being like you just watch what he does and accept that that's the true way of doing a thing. Reinforcement learning would be your dad's like, Here, why don't you play?

Speaker 113:30 - 13:52

你还没有亲自去和这个游戏互动。有时这被称为 expert demonstrations（专家示范），也就是，你知道，你你你只是在试着记住别人是怎么做的。它是 supervised learning（监督学习）的一种版本。但这里的 supervision（监督）有点像：你只是看他怎么做，并接受那就是真正正确的做法。至于 reinforcement learning（强化学习），则更像是你爸爸说：“来，你自己玩玩看？”

Speaker 113:52 - 14:17

Maybe he shows you once, or maybe he doesn't even need to show you because the game is beautifully designed to to sort of take you from not knowing anything to to being able to play expertly as something called a curriculum. But you you play. Maybe the first thing you do is you run, you hit the first bad guy, and you probably this example is dated, but, you know, you you you lose a life. But then the second time you press a button and you jump in. So you're taking actions, there's an environment that's giving you feedback.

Speaker 113:52 - 14:17

也许他给你演示一次，也许甚至都不需要演示，因为这个游戏设计得非常好，能够通过一种叫 curriculum（课程式推进）的方式，把你从一无所知逐步带到能够熟练游玩。但关键是你自己去玩。也许你做的第一件事是往前跑，打到第一个坏人，然后——这个例子可能有点过时了——你就掉了一条命。但第二次你按了个按钮，跳了起来。所以是你在采取 actions（动作），同时有一个 environment（环境）在给你反馈。

Speaker 114:17 - 14:47

And you you know, there's this this close connection between the environment, you know, be between actions that you can take and then the responses that you're getting. And and then the the final part is there's a reward. And the reward, you know, can can be something that you get pretty often. For for instance, every time you do something, there's there's some score that goes up, or it could be just something that you get at the end. So you play a game of of chess, and and and at the very end, you you get a reward, which is you won or you lost.

Speaker 114:17 - 14:47

而且你会知道，在这个过程中，environment（环境）、你能采取的 actions（动作），以及你得到的 responses（响应）之间，有着非常紧密的联系。然后最后一个部分是 reward（奖励）。这个 reward（奖励）有时可能是你经常能拿到的。比如说，每次你做了什么事，都会有某个分数上升；也可能它只会在最后给你。所以你下一盘 chess（国际象棋），到了最后，你才得到一个 reward（奖励），也就是你赢了或者你输了。

Speaker 114:47 - 15:10

But in the middle, you don't really know how you're doing until until the very end. So this this is called sparse resistance rewards, but I think this is a basic idea in that, and there's there's obviously lots of variants here, and ways to cope with this, but it's this this notion that you you interact with an environment, you get a reward, and often it's it's in a way where you get the sort of feedback as opposed to just trying to, you know, learn from data that you don't get to interact with.

Speaker 114:47 - 15:10

但在中间阶段，你其实并不真正知道自己表现如何，直到最后才知道。所以这叫 sparse rewards（稀疏奖励），不过我认为这里的基本思想就是：当然，这里面显然有很多变体，也有很多应对办法，但核心概念是，你会和一个 environment（环境）互动，你会得到 reward（奖励）；而且很多时候，这种方式会给你某种反馈，而不是只是去试着从那些你无法亲自互动的数据中学习。

Speaker 215:10 - 15:15

And why does it work and why is RL so powerful?

Speaker 215:10 - 15:15

那它为什么有效？为什么 RL（强化学习）会这么强大？

Speaker 115:15 - 15:46

It works because of this ability to get feedback from the environment. You can go and learn, you know, if you're if you're doing it right, you can figure out how to learn the things that that that you don't know. And I also think it's powerful because of this fact that it's it's much easier to learn when you're learning at the right level for you. Right? So if you want to learn, you know, addition, you shouldn't read a calculus textbook, you want to learn by being able to practice and learn at the right level.

Speaker 115:15 - 15:46

它之所以有效，是因为它能够从 environment（环境）中获得反馈。你可以去学习；如果你做对了，你就能想办法学会那些你原本不知道的东西。我也认为它之所以强大，还因为一个事实：当你是在适合自己的层次上学习时，学习会容易得多。对吧？如果你想学加法，就不该去读一本微积分教材；你应该通过练习、并在合适的层次上学习来掌握它。

Speaker 115:46 - 15:57

I'm actually making the choices and and learning from my own choices, whether they work or not, then then I'm able to, like, place it in a better context for for the set of things that I understand.

Speaker 115:46 - 15:57

如果是我自己在做选择，并且从我自己的选择中学习——不管这些选择是否有效——那么我就能够把它更好地放进一个语境里，和我已经理解的那一整套东西结合起来。

Speaker 215:58 - 16:03

Great. And then conversely, what's the catch and how does RL break?

Speaker 215:58 - 16:03

很好。那么反过来说，问题出在哪里？RL（强化学习）又会怎样失效？

Speaker 116:03 - 16:37

The setting where very difficult is is the setting that I alluded to before where you don't get much feedback from the environment. You have to take many, many, many, many actions, and then you get maybe, yes, that whole set of actions was good or no, it was bad. For instance, you're playing a game of chess and you don't know what you know, until you make all the moves. That has an opponent, so it's maybe complicated. Maybe it's you are trying to do a homework problem, and it's a research level, or, you know, like someone gives you a well defined problem like we give our language models.

Speaker 116:03 - 16:37

真正非常困难的情形，是我之前提到过的那种：你从环境中得不到太多反馈。你必须采取很多很多很多很多步行动，然后最后可能才得到一句“是的，这整套行动都不错”或者“不是，这整套都很糟”。比如说，你在下一盘国际象棋，在把所有棋都走完之前，你并不知道自己到底做得怎么样。这个例子里还有对手，所以可能更复杂一些。也可能是你在尝试做一道作业题，而且是研究级别的，或者，你知道，就像有人给你一个定义明确的问题，就像我们给 language models 的那样。

Speaker 116:37 - 17:03

And it's, you know, a problem that requires days and days of thinking. There's so many choices that you can make along the way. And at the end, if you if you don't get any feedback at all, if you're just hidden in the woods by yourself scribbling in notebooks, it's very hard to make progress that way because you don't have any feed you don't have any sense. If you get a yes at the end or you get a no at the end, you have no sense for which of the the actions that you took, which of the things you did were were good or bad.

Speaker 116:37 - 17:03

而且这类问题可能需要连续思考很多很多天。过程中你可以做出的选择非常多。如果到最后你根本得不到任何反馈，如果你只是一个人躲在树林里、埋头在笔记本上乱写，那样是很难取得进展的，因为你没有任何反馈，你没有任何感觉。即使你最后得到一个 yes，或者最后得到一个 no，你也完全不知道自己采取的那些行动里、做过的那些事情里，哪些是好的，哪些是坏的。

Speaker 217:03 - 17:16

Okay. Great. Now let's talk about how, RL has been applied in the context of large language models. So was the first step historically RLHF?

Speaker 217:03 - 17:16

好的，很棒。现在我们来谈谈 RL 在 large language models 场景中是如何被应用的。那么从历史上看，第一步是 RLHF 吗？

Speaker 117:16 - 17:42

Yeah. I think that's probably fair, at least in a broad sense that the first kind of RL that was done on language models was part of this post training process to turn a model that just tries to predict the next word on the Internet into into either something that will follow your instructions, be nice to you, you know, or, like, fit the form of a of a chatbot.

Speaker 117:16 - 17:42

对，我觉得这么说大概是公平的。至少从广义上说，最早用在 language models 上的那类 RL，是 post-training 过程的一部分；它的目标是把一个只会在 Internet 上预测下一个词的模型，变成一个会遵循你的指令、会友好回应你，或者说，能符合 chatbot 这种形式的东西。

Speaker 217:42 - 17:47

So do you wanna define for people what RLHF is and sort of how it works quickly?

Speaker 217:42 - 17:47

那你要不要给大家定义一下什么是 RLHF，并快速讲讲它是怎么工作的？

Speaker 117:47 - 18:19

The basic idea is that you could use collect data from from humans. That's the r RLHF is reinforcement learning from human feedback. So you collect data from humans and you train a value function. So you would show in the language model setting, say, two different completions from from from a language model, ask them to say which is better. This this sort of comparisons could be used to train a value function, and then you can use that as a as a as a reward for for reinforcement learning process.

Speaker 117:47 - 18:19

基本思路是，你可以从 humans 那里收集数据。RLHF 就是 reinforcement learning from human feedback（基于人类反馈的强化学习）。所以你从 humans 那里收集数据，然后训练一个 value function。比如在 language model 的场景里，你可以展示两个不同的 completions，它们都来自某个 language model，然后让人来判断哪个更好。这类比较数据可以用来训练一个 value function，接着你就可以把它当作 reinforcement learning 过程中的 reward 来使用。

Speaker 218:19 - 18:24

Great. And you you do that initially with humans, but then you build that into a reward model?

Speaker 218:19 - 18:24

很好。所以你一开始是用 humans 来做这件事，但之后你会把它做成一个 reward model，对吗？

Speaker 118:25 - 18:41

Yeah. So you would train a model for this, and then and then now because during the training process, you can't just pause your training run to ask some humans for for input. Right? The the feedback the that that would have way too much latency. So instead, you need a proxy for what a human would say.

Speaker 118:25 - 18:41

对。所以你会为这件事训练一个模型。然后，因为在训练过程中，你不可能每次都暂停 training run 去找一些 humans 来提供 input，对吧？那样反馈的 latency 会高得不可接受。所以你需要一个 proxy，去近似 human 会怎么说。

Speaker 118:41 - 18:47

So you train this model based on based on the human preference data, and then you can optimize against it or at least a little bit.

Speaker 118:41 - 18:47

所以你会基于人类偏好数据来训练这个模型，然后你就可以针对它进行优化，或者至少在一定程度上进行优化。

Speaker 218:48 - 19:01

One of the famous things in the history of our RL is moves 37. How do you train a model to encourage the model to do that kind of things and come up with brand new ways while being efficient and exploit known path?

Speaker 218:48 - 19:01

在我们 RL（强化学习）的历史中，一个很有名的事情就是 move 37。你要怎样训练一个模型，既鼓励模型做出那种事情、想出全新的方法，又同时保持高效并利用已知路径呢？

Speaker 119:02 - 19:19

Yeah. So the great thing about Go is that you can just train it. It's a, you know, zero sum two player game. You can train train in what's called self play. It plays itself, and it can go from playing randomly to expert play, and it will find whatever the the sort of best best strategies are.

Speaker 119:02 - 19:19

对。Go 的一个很棒之处在于，你可以直接训练它。它是一个零和的双人博弈。你可以用所谓的 self play（自我博弈）来训练。它和自己下棋，可以从随机乱下逐步提升到专家级水平，并且它会找到各种某种意义上的最佳策略。

Speaker 119:20 - 19:46

So if that that means exploring, great. That means exploiting actually, I have a I have a funny story about this. So I I met Noam Brown in grad school. He went to a different grad school than me, but he wanted to enter MIT's poker bot competition. And he had a poker bot that was the best in the world, but it wasn't something that would compete against humans yet.

Speaker 119:20 - 19:46

所以，如果这意味着要探索，那很好；如果这意味着要利用，其实也很好。我这里有个挺有意思的故事。我在读研时认识了 Noam Brown。虽然他和我不在同一个研究生院，但他想参加 MIT 的 poker bot 比赛。他当时有一个世界上最强的 poker bot，但它还不是那种已经能和人类真正竞争的水平。

Speaker 119:46 - 20:14

He just won in this research competition. He collaborated with me and another friend to enter MIT's poker bot competition. This is great actually for me, because I learned some really exciting work in in AI and I got very excited about this while I was doing doing physics. We were playing essentially this this kind of self play equal of equilibrium strategy. There's some some nuances, but, you know, essentially, we could not lose assuming we did not have any bugs in in our code.

Speaker 119:46 - 20:14

他刚刚在一个科研比赛中获胜。后来他和我、还有另一位朋友合作，一起参加 MIT 的 poker bot 比赛。其实这对我来说特别棒，因为我由此接触到了一些非常令人兴奋的 AI 研究工作，而那时我本来还在做 physics。我们本质上采用的是一种 self play 下的 equilibrium strategy（均衡策略）。这里面有一些细节差别，但基本上，只要我们的代码没有 bug，我们就不可能输。

Speaker 120:16 - 20:54

The way this this thing worked was that it was a tournament where you would be paired with, say, another another person and play them in an end if you, know, you depending on the amount of points you got in, like, some sort of round robin setup, they would eliminate the bottom half, and they would keep going until you got to the final table, which would just be, say, you versus the other person. And so the scores, there was the award ceremony, and we didn't know what what happened. But the there was someone else who was, you know, what what what did the what was everyone's scores over time look like? And there was, know, say 64 I think there were 32 actually people playing. So it was like around, 32 kinda tournament.

Speaker 120:16 - 20:54

这个东西的运作方式是这样的：那是一个锦标赛，你会和另一个人配对进行对局；然后根据你在某种 round robin（循环赛）机制里拿到的积分，他们会淘汰掉排名靠后的半数选手，再继续进行，直到最后来到 final table，也就是比如只剩下你和另一个人对决。到了公布比分、颁奖的时候，我们其实并不知道结果如何。但现场有人展示了一个图，大意是看每个人的分数随时间是怎么变化的。我记得当时好像是 64 人，不过我想实际上应该是 32 人参赛，所以大概是一个 32 人规模的比赛。

Speaker 120:54 - 21:22

And 30 people over time, you know, their scores would were all very negative and going down. And then there was one person whose score was, like like, pretty much straight up. And then there was another that was, like, pretty good, but but not, like, with a crazy slope. And so do you wanna guess which which one we were? So so we were we were the lower slope, and then there was this other guy that was had this crazy slope, which just, like, completely crushing all the other players.

Speaker 120:54 - 21:22

随着比赛进行，30 个人的分数基本都非常负，而且还在不断往下掉。然后有一个人的分数走势几乎是笔直向上的；还有另一个人的表现也不错，但斜率没有那么夸张。所以你想猜猜我们是哪一个吗？我们是那个斜率较小的，而另外那个人的斜率特别夸张，简直是在彻底碾压其他所有选手。

Speaker 121:22 - 21:42

And then this happened for the round of 16, the round of eight, the round of four. And then in the round of two, it's heads up us versus this guy who is, like, over the course of this tournament, won way more than us overall, like, taking more money from from everyone else. And then we crushed him. Because why? Because he was exploiting the weaknesses of of of everybody else.

Speaker 121:22 - 21:42

接着这种情况在 16 强、8 强、4 强都发生了。然后到了最后两人对决的时候，就是我们和那个家伙单挑；在整个锦标赛过程中，他总体上赢得比我们多得多，也就是从其他所有人那里拿走了更多“钱”。但最后我们把他打爆了。为什么？因为他一直是在利用其他所有人的弱点。

Speaker 121:42 - 22:07

Right? It it's had some theory of mind to try to figure out, oh, this this guy, you know, does this when he bluffs. And and so it's very I assume it was very good at at, like, taking, you know, exploiting everyone else, but we were just playing the best possible thing that you could that you can do given you know, so the the criteria was not maximize your your amount that you get from anyone else. It was don't lose. Like, put you know?

Speaker 121:42 - 22:07

对吧？它当时需要某种 theory of mind（心智理论）来试着判断，哦，这个人你知道，会在虚张声势时这么做。所以我猜它一定非常擅长，比如说，利用其他所有人的行为模式；但我们当时采取的只是你在已知条件下所能做出的最佳策略，所以标准并不是让你从别人那里拿到最多，而是不要输。对吧，我是说？

Speaker 122:07 - 22:16

And and and, you know, so it's the best response to anyone's strategy. And so at the end, we had to win, assuming we did it right, and someone else playing the same strategy would would tie.

Speaker 122:07 - 22:16

而且，而且，而且，你知道，它就是对任何人的策略的最佳响应。所以到最后，假设我们做对了，我们就必须赢；而如果另一个人采用同样的策略，那结果就会是平局。

Speaker 222:16 - 22:35

Okay. Fascinating. So just, tying this back to the beginning of the conversation about the Erdos problem and and solving, unsolved math problems. Presumably, the the instinct would be that you need a lot of exploration and exploitation. So how does how does that work in, the context of novel scientific discovery?

Speaker 222:16 - 22:35

好的，很有意思。那么把这个话题和谈话一开始关于 Erdos problem 以及求解尚未解决的数学问题联系起来看，直觉上你大概会认为这需要大量的 exploration（探索）和 exploitation（利用）。那么在全新的科学发现语境下，这具体是怎么运作的？

Speaker 122:35 - 23:19

I think math research or scientific research in general has a lot of versions of both explore and and and exploit. To give to give the recent example, the OpenAI unit distance proof, I think, is is is very much in the explorer setting where the the model was happy to be contrarian and and try to disprove this thing that everyone believed. And it was it was just looking for has this huge repository of understanding all of human math. And so it was it was spending a very long amount of time, I forgot how how many hours, but I think we published a reverting version of this chain of thought, like hours and hours trying different things. So it's clearly in the in the domain of exploration.

Speaker 122:35 - 23:19

我认为数学研究，或者更广泛的科学研究，一般都同时包含 explore（探索）和 exploit（利用）的很多变体。举最近的例子，OpenAI 的 unit distance proof，我觉得非常属于 explorer（探索者）范式：模型愿意唱反调，去尝试推翻这件大家都相信的事。它依托的是一个对整个人类数学理解的巨大知识库。所以它花了非常长的时间，我忘了具体是多少小时了，但我记得我们公布过一个回溯版本的 chain of thought（思维链），里面是它连续几个小时、好几个小时地尝试不同的方法。所以这显然属于探索的范畴。

Speaker 123:20 - 23:56

A lot of times, though, you can ask these models to compute something that they understand very well, and then that that has a different structure and might look a lot like exploit. There's a a paper that came out recently after the OpenAI result where the an unrelated Erdos problem has something to do with if you have a set and you try to add it to a cell add the set to itself or you try to multiply the set with itself. So, like, take the elements and add them all together or take the element individualizer, multiply them together and how many unique sums or products you get. There's some conjecture around that. And this this one was also disproved.

Speaker 123:20 - 23:56

不过很多时候，你也可以让这些模型去计算一些它们本来就非常理解的东西，那样它的结构就不同了，看起来会更像 exploit（利用）。在 OpenAI 那个结果之后，最近又出了一篇论文，讨论另一个不相关的 Erdos problem：大意是，如果你有一个集合，然后你尝试把这个集合和它自身相加，或者把这个集合和它自身相乘。也就是说，比如把元素两两相加，或者把单个元素彼此相乘，然后看你会得到多少个不同的和或积。围绕这个问题有一个 conjecture（猜想）。而这个猜想也被推翻了。

Speaker 123:56 - 24:30

And and that was done by by humans. And the core idea was, it's like a totally different problem. But there there was inspiration from the from the unit distance one, the idea that you can sort of generalize from from the as pick a pick a certify it, pick a certain type of numbers that had a certain property that the OpenAI model figured out, and that they realized that like this applies in this setting. So that's very much an exploit thing. But and so so I think the process clearly is like this, the actual discovery process.

Speaker 123:56 - 24:30

而且，而且那是人类完成的。核心想法是——这其实是一个完全不同的问题——但它受到了 unit distance 那个问题的启发：也就是你可以某种程度上从那里做泛化，选取一类具有某种性质的数字，而这种性质是 OpenAI 模型发现的，然后他们意识到，像这样的性质也适用于这个场景。所以这非常像一种 exploit（利用）。不过，所以我认为这个过程显然就是这样，真正的发现过程就是这样运作的。

Speaker 124:31 - 24:50

I think normally when you talk about explore exploits, maybe we're talking about when training reinforcement learning models, how should we train them? But I think there's this interesting point that in the scientific discovery process, there's really this interplay between like exploring exploration, and then exploitation in order to totally push the field forward.

Speaker 124:31 - 24:50

我觉得通常当我们谈 explore 和 exploit 时，也许说的是在训练 reinforcement learning（强化学习）模型时，我们该怎么训练它们。但我认为有一个很有意思的点是：在科学发现的过程中，确实存在这种 interplay（相互作用）——一方面是探索，另一方面是利用，而正是这种配合才会真正把整个领域向前推进。

Speaker 224:50 - 25:15

Switching to RL in modern LLM systems. So there used to be a saying, which I think comes from Jan LaConde, that RL was the cherry on top of the cake, but I think you have argued that things have switched now and that Aurel is the main part, the the cake. Do you wanna just walk us through what you were thinking?

Speaker 224:50 - 25:15

转到现代 LLM（大语言模型）系统里的 RL（强化学习）。以前有一种说法——我觉得这话来自 Jan LaConde——说 RL 只是蛋糕上的樱桃，但我想你一直在主张，现在情况已经反过来了，RL 才是主要部分，也就是蛋糕本身。你愿意带我们过一遍你的思路吗？

Speaker 125:15 - 25:30

Yeah. I said that about a year and a half ago. I had to give a talk that was public, and I couldn't say much. So I decided to invert this meme with the with this cake and the cherry. RL is really exciting.

Speaker 125:15 - 25:30

对，是的。我大概在一年半前说过那番话。当时我得做一场公开演讲，但很多东西我不能多说。所以我就决定用这个蛋糕和樱桃的 meme 把那个说法反过来表达。RL（强化学习）真的非常令人兴奋。

Speaker 125:31 - 25:46

That's what what I'm here talking about. And I think that when you have a lot of compute, you want to turn that compute into intelligence in a way that's that's useful. And RL is one way of doing it. And we just started doing it then. And we're gonna do a lot more of it now.

Speaker 125:31 - 25:46

这也是我今天来这里要讲的内容。我认为，当你拥有大量 compute（算力）时，你会想把这些 compute 转化为有用的 intelligence（智能）。而 RL 就是实现这一点的一种方式。那时候我们才刚开始做这件事，而现在我们会在这方面做得更多。

Speaker 225:46 - 25:57

Why did RL start working well? It's not an entirely new concept. It's been tried for many years now. What is different now?

Speaker 225:46 - 25:57

为什么 RL 现在开始表现得这么好了？这并不是一个全新的概念。很多年来人们都在尝试它。现在到底有什么不同？

Speaker 125:57 - 26:28

Yeah. I'm not sure, to be honest, what when people say it wasn't working, what that actually means. Like, there was this 2016, '27, maybe even to 2018 before the transform period where DeepMind was all in on RL, and OpenAI had Dota and the Rubik's cube and some other exciting exciting results as as well. But a lot of people were all in on RL, and then there were language models. And the obvious thing to do was scale up the thing that worked, which was pre training.

Speaker 125:57 - 26:28

是的。老实说，我不太确定，人们说它“以前不行”的时候，具体到底指的是什么。比如在 2016、2017，甚至可能到 transformer 时代之前的 2018，DeepMind 都非常投入 RL，OpenAI 也有 Dota、Rubik's cube，以及其他一些同样很令人兴奋的成果。很多人当时都在全力做 RL，后来 language model（语言模型）出现了。于是显而易见的做法，就是把那个已经被证明有效的东西继续 scale up（扩展规模），也就是 pre-training（预训练）。

Speaker 126:29 - 27:15

And I don't know whether or what people tried for RL. As you pointed out, RLHF was a central thing that that came pretty quickly. Originally, it was developed for in the in the context of of of game environments of of, like, trying to prevent reward hacking by using, think the original paper was about using human feedback to like, control like a character for what or some something like that. But there's an interesting thing to point out here, though, which is that there's this question of how do you get models to think in test time and reason. And there's a reasoning effort at OpenAI that was quite early and spent some time and and came up with some some algorithms.

Speaker 126:29 - 27:15

至于人们当时到底为 RL 尝试过什么，我也不完全清楚。正如你指出的，RLHF（基于人类反馈的强化学习）很快就成为了一个核心方向。它最初其实是在 game environment（游戏环境）的语境中发展出来的，类似于想通过使用人类反馈来防止 reward hacking（奖励劫持）；我记得最早那篇论文大概是在讲用 human feedback（人类反馈）来控制一个角色之类的东西。不过这里还有一个有意思的点值得指出，那就是一直有这样一个问题：怎么让模型在 test time（测试时）进行思考和推理。OpenAI 很早就有一项关于 reasoning（推理）的工作，投入了一些时间，也提出了一些算法。

Speaker 127:15 - 27:29

I think maybe the simple thing to say is that if you have a powerful enough pre trained model, then it can start to do well at RL. They can start to, like, think at at use test time compute to to, for instance, solve solve math problems that it that it wouldn't otherwise be able to do.

Speaker 127:15 - 27:29

我觉得也许一个简单的说法是：如果你有一个足够强大的 pre-trained model（预训练模型），那么它就会开始在 RL 上表现得不错。它们会开始真正“思考”，会利用 test-time compute（测试时算力）去做事，比如解那些原本它并不能解出的数学题。

Speaker 227:30 - 27:52

This viral analysis from earlier this year, February, I think, that claims that Aurel produces less than one bit of useful information per 10,000 tokens, and then Carpathi called it sucking supervisions through a straw. What is your take on on this and the overall efficiency of of Aurel?

Speaker 227:30 - 27:52

今年早些时候，大概是 2 月，有一篇传播很广的分析声称，Aurel 每 10,000 个 token（词元）产生的有用信息还不到 1 bit，随后 Carpathi 把这形容为“through a straw sucking supervision（像用吸管吸监督信号）”。你怎么看这个说法，以及 Aurel 整体的效率？

Speaker 127:52 - 28:22

If you look at, like, the deep seq algorithm, which is a public thing that that we can talk about, then the you train on on sequences that that are correct. So whether it's correct or not is maybe one bit of information. So I I think there's you know, you can see where that logic comes from. I think the question is, like, is this doing a kind of thing that you can't otherwise do? Like, maybe you would want to give more supervision, but how how are you going to do that?

Speaker 127:52 - 28:22

如果你看一下 deep seq algorithm——这是一个公开的东西，所以我们可以讨论——那么它训练用的是那些正确的 sequence（序列）。所以，一段序列到底正不正确，也许确实只有 1 bit 的信息量。于是我觉得，你可以理解这种逻辑是怎么来的。我认为真正的问题是：它是不是在做一种你用别的方法做不到的事情？比如，也许你会想提供更多 supervision（监督信号），但你到底要怎么提供呢？

Speaker 128:22 - 28:47

I think I think it's very clear that that this method these methods have lead led to a bunch of of breakthroughs in terms of, the explosion of what the models can do and and, you know, both in in coding and in science. I think broadly, it's about getting models to think in in test time, to use test time computing and and and and do reasoning. And there's clearly a lot of the pieces of what the RL process is that's essential to make that work.

Speaker 128:22 - 28:47

我觉得很明显，这种方法、这些方法已经带来了一系列突破，尤其体现在 model（模型）能力的爆发上——无论是在 coding（编程）还是 science（科学）方面都是如此。总体来说，核心是在 test time（测试时）让 model 去“思考”，去利用 test-time computing（测试时计算）并进行 reasoning（推理）。而且很显然，RL（reinforcement learning，强化学习）过程中的很多组成部分，对让这件事真正奏效是必不可少的。

Speaker 228:47 - 29:32

What's your overall feeling in terms of how far we can go with that current sort of systems model where we have pre training and then we have RL on on on top. Somewhat famously last year, there was a conversation with, Rich Sutton on the Dwarkish podcast where his claim, my best attempt to paraphrase it, was that LLMs were not really intelligent and therefore RL was the only way to do it and pure RL, not LM plus RLs. What is your take on this? I mean, obviously, you're in on on an RL team at a company that does, you know, both pre training and RL combined. So what's your what's your take?

Speaker 228:47 - 29:32

你整体上怎么看这种当前的 systems model（系统范式）还能走多远，也就是我们先做 pre-training（预训练），然后再在上面叠加 RL。去年有个相当有名的讨论，是 Rich Sutton 在 Dwarkish podcast 上说的——如果我尽力转述他的意思，那就是：LLM（large language model，大语言模型）其实并不真正智能，因此唯一可行的路径是 RL，而且是纯 RL，不是 LM 加 RL。你怎么看这个观点？当然，你自己就在一家同时做 pre-training 和 RL 结合的公司的 RL 团队里。所以你的看法是什么？

Speaker 129:32 - 30:02

Let me tell another story. So when I when I was before I did my PhD, I spent two years in The UK, and I was at Oxford for one of those years. And I was at a pub as one does, and, you know, two of my close friends, one was the cognitive scientists, and one was linguist. And so we had the sort of argument that you do in those situations when you're that age. And so something like physics is the most fundamental of all the sciences, because explains how the world works and everything is in the world.

Speaker 129:32 - 30:02

我再讲个故事吧。在我读 PhD 之前，我在 The UK 待了两年，其中有一年在 Oxford。然后我像大家那样去 pub（酒吧），当时我身边有两个很亲近的朋友，一个是 cognitive scientist（认知科学家），一个是 linguist（语言学家）。于是我们就进行了那种在那个年纪、那种场合很常见的争论。大意是：像 physics（物理学）这样的学科，才是一切科学中最基础的，因为它解释了世界如何运作，而一切都存在于这个世界之中。

Speaker 130:02 - 30:24

I said this earlier, my computer exists in the world, I exist in the world, we all follow the laws of physics. And then the cognitive scientists said something like, yes, but then you have to, you know, you have to process it. So there's all sorts of cognitive biases about that and and and, you know, the way you collect data and learn something something. But then the linguist was was like blah blah blah, Wichtenstein. You know, the everything goes through language.

Speaker 130:02 - 30:24

我之前也说过，我的 computer（计算机）存在于这个世界里，我也存在于这个世界里，我们都遵循 physics（物理）的规律。然后那位 cognitive scientist 大概说的是：是没错，但接下来你还得去处理这些东西。所以这里面会有各种 cognitive bias（认知偏差），以及你收集 data（数据）、学习某些东西的方式之类的问题。但接着那个 linguist 就开始说，什么 Wichtenstein 之类的——意思是，一切都要通过 language（语言）。

Speaker 130:24 - 30:37

That's the method of communication. That's, you know, that's the way words mean things are are the central thing. And and when we wanna talk about the laws of physics, we have to use language. And I sort of feel like he and Victor Scheide, like, that that was correct. Right?

Speaker 130:24 - 30:37

语言就是 communication（沟通）的方法。它就是——你知道——词语如何获得意义，这才是核心。然后当我们要谈论 physics（物理）定律时，我们也必须使用 language。我现在有点觉得，他还有 Victor Scheide，说得是对的。对吧？

Speaker 130:37 - 31:02

That's what's what or at least the path through AI suggest that that is a correct path. I'm I'm conceding now to Kyle. And if he's if he's listening, he's now a linguistics professor. This whole idea of reinforcement learning that kicked off the previous decades interested in AI, the the sort of grounding, I think, that was needed was was to to make things really work is through is through language because everything goes through language. Right?

Speaker 130:37 - 31:02

这就是——或者至少说，AI 走过的这条路径表明——那确实是一条正确的路径。我现在算是向 Kyle 让步了。如果他在听的话，他现在已经是 linguistics professor（语言学教授）了。过去几十年里，reinforcement learning（强化学习）这个概念确实引发了 AI 领域的巨大兴趣；但我认为，当时真正让事情运转起来所需要的那种 grounding（扎根、落地），其实是通过 language 来实现的，因为一切都要经过 language。对吧？

Speaker 131:02 - 31:47

All the Internet, right, it's like it it incorporates the grounding of the real real world, all of our scientific knowledge, all of our mathematical knowledge, all of the human like, you know, the sum total basically of of human work is represented on the Internet in language. And then so having the model have a prior of language and being able to, like, think in think in language and then train on top of that, that seems like clearly the right the right thing to do and and, like, you know, seems also well well grounded in in a way that even before all this, somebody might have argued would would make sense. It's like an amazing prior to have, to to start with for for an intelligence because it's, like, very much based on us and our society. I have other disagreements with Rich Sutton, but if you want to poke at that.

Speaker 131:02 - 31:47

整个 Internet（互联网）都是这样，对吧？它吸收了真实世界的 grounding，承载了我们全部的 scientific knowledge（科学知识）、全部的 mathematical knowledge（数学知识），以及整个人类——你知道——基本上所有工作的总和，都是以 language 的形式表现在 Internet 上的。所以，让 model 先拥有 language 这个 prior（先验），能够用 language 去思考，然后再在这个基础上继续训练，这显然就是正确的做法。而且，这种做法还有一种很扎实的 grounded（有根基的）感觉：甚至在这一切发生之前，就可能已经有人会论证说，这本来就说得通。对于一种 intelligence（智能）来说，把它作为起点是个惊人的 prior，因为它在很大程度上是建立在我们以及我们的社会之上的。至于我和 Rich Sutton 还有别的分歧，如果你想继续追问的话。

Speaker 231:47 - 31:51

Yes, so just give us one or two quick ones.

Speaker 231:47 - 31:51

好，那就快速说一两个吧。

Speaker 131:51 - 32:18

I have a somewhat contrary intake that with the better lesson that it's not that scale is all you need. You need to also have have good ideas to to guide the scaling. There's there's like a deeper interplay than just scale things up. For for instance, if you were just trying to scale pre training, you wouldn't get anywhere near as far as also trying to scale RL on top of pre training, which is what we do now, and our models are much more powerful for, like, that very good idea, investing in that good idea.

Speaker 131:51 - 32:18

我的看法有点相反：更好的经验教训并不是“只要有 scale（规模化）就够了”。你还需要有好的想法来引导这种 scale。这里面有一种比“单纯把东西做大”更深层的相互作用。比如说，如果你只是想把 pre-training（预训练）做大，你能走到的地方，远远不如在 pre-training 之上再把 RL（强化学习）也规模化——这就是我们现在在做的事——而我们的模型也因此强大得多。这很大程度上就是因为那个非常好的想法，并且我们对那个好想法进行了投入。

Speaker 232:18 - 32:20

And that the good ideas come from humans?

Speaker 232:18 - 32:20

所以这些好想法是来自 humans（人类）的吗？

Speaker 132:20 - 32:41

Well, maybe they'll come from AI in the future, but before we had AI, they they came from humans. I mean, scaling was also a good idea that came from from humans, but but there's, like, this this interplay of illicit new phenomena at scale. You try to understand those them at that scale, and then that points you at new directions, and then you develop new ideas, and then you try to apply scale on those ideas. So I think it's not just scale, scale, scale.

Speaker 132:20 - 32:41

嗯，也许未来它们会来自 AI，但在我们有 AI 之前，它们确实来自 humans。我的意思是，scaling（规模化）本身也是一个来自 humans 的好想法，但这里面还存在一种互动：在 scale 上会涌现出新的现象。你会试着去理解那些在那个规模上出现的现象，而这又会把你指向新的方向；然后你发展出新的想法，再尝试把 scale 应用到这些想法上。所以我认为，这不是单纯的 scale、scale、scale。

Speaker 232:41 - 33:03

Since you mentioned test time compute, I think that's something that that still puzzles people, which is the the whole chain of thought, a thing which is so magical from a user perspective, whatever you can see. What actually happens during test time compute that creates those artifacts? What does the model actually do?

Speaker 232:41 - 33:03

既然你提到了 test time compute（测试时计算），我觉得这仍然是很多人困惑的地方，也就是整个 chain of thought（思维链）这件事。从用户视角看，不管你能看到什么，它都显得很神奇。在 test time compute 期间，究竟发生了什么，才会产生那些表象？模型实际上在做什么？

Speaker 133:03 - 33:36

I I think it does what what you see it do. We lightly rewrite it or summarize it, but it it just it just produces tokens. And those tokens are like a running thought process just like you you might have, or or maybe it's more akin to if you're solving math problem, the the scratch pad, the collection of notes that that that you have, but it it just keeps generating. The cool thing about generating is that, you know, it it it's a a forward pass the model, so we're using a bunch of computation. So we're we're you know, it's a way of leveraging a lot more computation on a problem than than you would before.

Speaker 133:03 - 33:36

我觉得，它做的就是你看到它在做的事。我们会轻微改写或者总结一下，但它本质上就是在不断产生 tokens（词元）。而这些 tokens 就像一个持续进行中的思考过程，类似于你自己的思路；或者更贴切一点，像是你在解数学题时写下的 scratch pad（草稿纸）、一堆笔记，但它就是一直继续生成。生成这件事很酷的一点在于，你知道，那是模型的一次 forward pass（前向传播），所以我们用了大量计算。也就是说，这是一种在某个问题上利用比以前多得多计算量的方式。

Speaker 133:36 - 34:24

So my colleague, Noam Brown likes to talk about the Riemann hypothesis a lot. And you know, wouldn't you want to have a model that runs for years that that can resolve, resolve that prove that if you present it and you want to produce an answer, then it only has the number of flops in in a single forward pass to produce one one token of its forced answer right away. But if it gets to answer after think, you know, after a long time, can it can re reuse its weights, you know, produce a final answer that is a function of a much, much larger amount of computation and the, like, the natural way it thinks is in language. It's a language model. And so that's sort of the this key insight that that you can cause it to do better just by producing a thought process in in in token space in in language.

Speaker 133:36 - 34:24

我的同事 Noam Brown 很喜欢谈 Riemann hypothesis。你想想看，难道你不希望有这样一个模型：它可以运行很多年，从而解决、证明那个问题吗？如果你把问题直接呈现给它，并且要求它立刻给出答案，那它就只能依靠单次 forward pass 里的那点 flops（浮点运算次数），马上产出它那个被迫即时给出的答案中的一个 token。但如果它可以在“思考之后”再回答，经过很长时间之后再回答，那它就能反复复用自己的 weights（权重），最终给出一个由大得多、大得多计算量所决定的答案。而且，它自然的思考方式就是用 language（语言）来进行。它是一个 language model（语言模型）。所以这就是那个关键洞见：只要让它在 token space（词元空间）里、在 language 里产出一个思考过程，你就能让它做得更好。

Speaker 134:24 - 34:47

And this was known before RL, the the idea that if you asked the model if you gave a model examples of of thinking things out, it would do this before it produced a final answer. Or if you just told it that, then it would then it would do this sort of thing. Right? Going back to this SFT versus supervised learning versus reinforcement learning analogy that I gave earlier, like, there's a lot of examples on the Internet of people thinking for a long time. And so, like, it's not completely useless.

Speaker 134:24 - 34:47

而且这一点在 RL 之前就已经知道了：如果你要求模型，或者你给模型一些“把事情一步步想清楚”的示例，它在给出最终答案之前就会这么做。或者如果你只是这样告诉它，它也会做出这种事。对吧？回到我之前提到的那个 SFT（监督微调） versus supervised learning（监督学习） versus reinforcement learning（强化学习）的类比，Internet 上本来就有很多人长时间思考的例子。所以，这种东西并不是完全没用。

Speaker 134:47 - 34:50

It can channel that a bit, but RL really, like, brings that out.

Speaker 134:47 - 34:50

它可以稍微把这种能力引导出来一点，但 RL 才是真正把它充分激发出来。

Speaker 234:50 - 35:10

What happens during test and compute is is RL related or created? Because that that's effectively what you described earlier when you were defining RL. The model goes in one direction, decides maybe that's not a fruitful one, backtracks, tries something else. Is that correct or not?

Speaker 234:50 - 35:10

在测试和 compute（算力消耗）过程中发生的事情，是和 RL（强化学习）有关，还是由它产生出来的？因为这其实就是你之前在定义 RL 时描述的内容：模型朝一个方向前进，判断那也许不是一条有成果的路，于是回退，再尝试别的东西。这样理解对吗，还是不对？

Speaker 135:10 - 35:40

I think maybe the result of the RL process is that the model can then think at test time, and that's why we have these dials or various companies have labs have reasoning effort dials. Right? So so you now created a model that will produce a bunch of tokens before it outputs a final answer. And, like, causing that to be good is what RL is doing or one of the things RL is doing. And so the output of of doing RL training is the ability to have a model that thinks.

Speaker 135:10 - 35:40

我想，也许 RL 过程的结果，是模型随后能够在 test time（测试时）进行“思考”，这也就是为什么我们会有这些调节旋钮，或者说各种公司、实验室会有 reasoning effort（推理强度）旋钮。对吧？也就是说，你现在创造出了一个模型，它会在输出最终答案之前先生成一大串 token（词元）。而让这一过程变得有效、变得好，正是 RL 在做的事情之一。所以，进行 RL 训练的产出，就是得到一个会“思考”的模型。

Speaker 235:40 - 36:03

So one of the key questions in the field is, whether you can expand and generalize the success, that LLM systems have had in particular in coding and now math, but like domains where you can sort of verify whether the model comes up with is correct or not. What is your view on that? And perhaps start by explaining what a verifiable reward is.

Speaker 235:40 - 36:03

所以，这个领域里的一个关键问题是：你能否把 LLM（大语言模型）系统已经取得成功的那套东西扩展并泛化出去，尤其是在 coding（编程）以及现在的 math（数学）这些领域——也就是那些你多少可以验证模型给出的结果到底正确与否的领域。你怎么看这个问题？也许可以先从解释什么是 verifiable reward（可验证奖励）开始。

Speaker 136:04 - 36:43

So a verifiable reward is is is, in principle, a reward that that that can't be hacked. So it's a if the if it's a math problem and the answer is an integer, you just string match the integer, and and then you verify that it that it did it solved the problem correctly. That that abstraction has all sorts of problems with it, but unverify problem with an that can't be verified is is this a good piece of creative writing? That that that's there's not a something you can sort of string match against, right, that involves questions of taste and maybe different people ask differently. So maybe it's a distributional kind of thing.

Speaker 136:04 - 36:43

所谓 verifiable reward，原则上就是一种无法被 hack（钻空子／操纵）的 reward（奖励）。也就是说，如果这是一个数学题，而答案是一个整数，那你就只需要对这个整数做 string match（字符串匹配），然后验证它是否真的把题解对了。这个抽象本身有各种各样的问题；而与之相对，那种不能被验证的问题，比如“这是不是一篇好的创意写作？”——这就不存在一个你可以拿来做 string match 的东西，对吧？这会涉及审美判断，而且不同的人可能有不同看法。所以它也许更像是一种 distributional（分布式／分布意义上的）东西。

Speaker 136:43 - 36:47

And so there's there's clearly a big gap between those those two things.

Speaker 136:43 - 36:47

所以，这两类东西之间显然存在一个很大的鸿沟。

Speaker 236:47 - 37:01

So do you think there is a path for RL to be truly effective at domains without referable rewards? So, you know, consulting, banking, legal, I mean, clearly, there's tremendous progress in those domains, but, like, what what is happening?

Speaker 236:47 - 37:01

那么你觉得，对于那些没有 referable rewards（这里应指可供参照／可验证的奖励）的领域，RL 是否有一条真正有效的路径？比如说，咨询、银行、法律——我的意思是，这些领域显然已经有了巨大的进展，但到底发生了什么？

Speaker 137:01 - 37:09

I definitely think OpenAI will have amazing products that will be relevant in those domains, and some amount of RL will play a role in there.

Speaker 137:01 - 37:09

我非常相信，OpenAI 会推出在这些领域中也高度相关的优秀产品，而 RL 在其中会发挥一定作用。

Speaker 237:09 - 37:23

Does RL generalize, meaning that as you train it against more and more domains, becomes disproportionately good at learning the next domain?

Speaker 237:09 - 37:23

RL 会泛化吗？也就是说，随着你让它在越来越多领域上接受训练，它在学习下一个新领域时，是否会以不成比例的幅度变得更擅长？

Speaker 137:23 - 38:00

I mean, we wanna make a model that is generally intelligent and push that intelligence as far as possible. And to do that, we want to make everything part of the distribution. And then we also want to make it robust in cases where it encounters things that it was not in the distribution. If RL is part of that process, then, you know, like but, like, I think there's a vague sense, as I was trying to say earlier, is that there's a lot of things that are very fuzzy, but the you know, clearly, the question of generalization in AI is is is this important central one. And there's a bunch of examples, I think, that support that the processes can do this.

Speaker 137:23 - 38:00

我的意思是，我们想做一个具备通用智能的 model，并把这种智能尽可能往前推进。为此，我们希望让一切都成为 distribution（分布）的一部分。同时，我们也希望在它遇到不在这个 distribution 里的事物时，依然具备 robust（鲁棒）性。如果 RL（强化学习）是这个过程的一部分，那当然是这样；但我想，正如我之前试图表达的那样，大家隐约都有一种感觉：这里面有很多东西其实都很模糊，不过很显然，AI 里的 generalization（泛化）问题是一个重要而核心的问题。而且我认为，也有不少例子支持这些过程确实能够做到这一点。

Speaker 238:00 - 38:32

So going back to your physics roots, a lot of what we just described about, this interplay between, pretraining and RL and, you know, all the various bits that we described. Those are clearly pretty complex systems, and, you, were trained in a discipline that is all about studying complex systems. What can physics teach us about how to understand those those AI systems that we're currently building?

Speaker 238:00 - 38:32

所以回到你的 physics 背景，我们刚才描述的很多内容——比如 pretraining（预训练）和 RL 之间的相互作用，以及我们提到的各种不同部分——显然都是相当复杂的系统。而你受训于一个专门研究复杂系统的学科。那么，physics 能教会我们什么，帮助我们理解当前正在构建的这些 AI 系统？

Speaker 138:32 - 39:13

I think there's a lot of lot of angles to answering that question. I think the Maybe most interesting one or the most relevant one to how we work currently, and maybe this is a contrarian take, is that the way to think about scaling and and say scaling laws is not small to big, but big to small. So and and I'll get to why physics really matters for this in a second. When when you have, you know, you have the existence of some really big AI system and some weird things happen, and they didn't happen at the small scale. And so we say, like, oh, this whatever emerged at scale.

Speaker 138:32 - 39:13

我觉得可以从很多很多角度来回答这个问题。我想，也许最有意思的一个角度，或者说与我们当前工作方式最相关的一个角度——也许这还是个有点唱反调的看法——是：理解 scaling（扩展）以及所谓 scaling laws（扩展定律）的方式，不应该是从小到大，而应该是从大到小。我马上会说到为什么 physics 在这件事上真的很重要。当你面对一个非常大的 AI 系统时，会发生一些奇怪的现象，而这些现象在小规模下并没有发生。于是我们就会说，哦，这是某种“在规模上涌现出来”的东西。

Speaker 139:13 - 39:25

Right? There sometimes people use the word grokking, or there's something discontinuous about the scaling sequence. And so it's or the scaling law is broken. These are things that people might say. But I think I reject that entirely.

Speaker 139:13 - 39:25

对吧？有时候人们会用 grokking 这个词，或者说 scaling 的过程里出现了某种不连续性。又或者说，scaling law 失效了。人们可能会这样说。但我觉得我完全不接受这种说法。

Speaker 139:25 - 39:34

I think you it means that you didn't understand something about what you were scaling up. Maybe even going back to the reasoning thing. Like, I don't know if this is true. This is a cartoon. I wasn't at OpenAI at the time.

Speaker 139:25 - 39:34

我认为，这其实意味着你并没有真正理解你正在放大规模的那个东西。甚至可以回到 reasoning（推理）这件事本身。比如说，我也不知道这是不是事实，这只是个漫画式的简化；而且那时候我也不在 OpenAI。

Speaker 139:34 - 40:07

But if you imagine trying to, like, get small models to to reason, you know, g p d one, g p d two, g p d three, and then g p d four to they get cartoon. You might say like, oh, this emerged at scale, and it doesn't happen for the small models. You you know, the I I I reject that. Instead, there's some phenomena that's really exciting that we discovered, like reasoning or, you know, maybe something bad, like, you know, like your model blew up and your earlier models didn't blow up. And your job is to then figure out how to restore smoothness to the scaling sequence.

Speaker 139:34 - 40:07

但如果你设想自己在尝试让小模型学会 reasoning，比如 g p d one、g p d two、g p d three，再到 g p d four——把它非常卡通化地说——你可能会说，哦，这是在规模上涌现出来的，小模型不会这样。这个说法我不接受。相反，真正发生的是：我们发现了某种非常令人兴奋的现象，比如 reasoning；或者也可能是某种不好的现象，比如你的 model 崩掉了，而更早的模型没有崩掉。接下来你的任务，就是弄清楚如何让整个 scaling 的序列重新恢复平滑。

Speaker 140:07 - 40:37

Go back and make smaller and simpler models or or simpler toy examples such that the the whole thing is smooth. And if you can do that, if you can figure out what to put into the small thing, then then you understand the thing. And then and then you can you can move forward. This is exactly what we do in theoretical physics. There's the standard model, which is, you know, I have a textbook behind me, the description of all the forces except gravity would take even in compact notation, like the entire entire page.

Speaker 140:07 - 40:37

回过头去，构造更小、更简单的模型，或者更简单的 toy examples（玩具示例），使得整个过程是平滑的。如果你能做到这一点，如果你能搞清楚该往这个小系统里放入什么，那么你就理解了这个东西。然后你就可以继续往前推进。这正是我们在 theoretical physics（理论物理）中所做的事情。比如 Standard Model（标准模型），你知道，我身后就有一本教材；对除 gravity（引力）之外所有作用力的描述，即使用非常紧凑的 notation（记号）来写，也要占满整整一页。

Speaker 140:37 - 40:48

It's it's like completely gross. There's, you know, a lot of different particles. Why? You know, I who knows? Some of them, there's reasons for, but they're doing all sorts of different different things cancel whatever.

Speaker 140:37 - 40:48

它看起来就是那种完全复杂得有点恶心的东西。里面有很多很多不同的 particle（粒子）。为什么会这样？我也不知道。有些是有原因的，但它们在做各种各样不同的事情，彼此抵消之类的。

Speaker 140:48 - 41:18

You know, this just happens to be the universe that we live in. But you don't need all of that to to like, study pieces of it to like, study electromagnetism, you forget about everything else. Or if you want to study the Higgs phenomenon, you know, which gives mass to some particles, you can study a simplified version of that. And so what we do, and I think one of the key moves in, at least in my training in physics is to take really complicated systems. This often gets talked about as physicists just study spherical cows.

Speaker 140:48 - 41:18

你知道，这恰好就是我们所处的宇宙。但你并不需要把这一切都纳入，才能去研究其中的某些部分；比如你想研究 electromagnetism（电磁学）时，就先把其他一切都暂时放下。或者如果你想研究 Higgs phenomenon（Higgs 现象）——它会赋予某些粒子质量——你也可以研究它的一个简化版本。所以我们所做的，以及我认为至少在我的 physics（物理学）训练中一个关键动作，就是把真正复杂的系统加以简化。这常常被说成是物理学家只研究“spherical cows（球形奶牛）”。

Speaker 141:18 - 41:36

And and I think that that, like, kind of misses the point. Like, you you study if if the spherical cow is sufficient to describe the thing that you care about, then you did a good job. And if not, you did a bad job. You don't try to retreat to a setting that's simple enough where you can make pro where you can, like, calculate something. You try to retreat to the setting that's simple enough that contains the thing that you care about.

Speaker 141:18 - 41:36

但我觉得这种说法有点没抓到重点。如果那个 spherical cow 足以描述你关心的事物，那你就做得很好；如果不行，那你就做得不好。你不是退到一个仅仅因为足够简单、以至于你能算出点什么来的设定里。你要退到的是一个足够简单、但仍然包含你所关心之物的设定。

Speaker 141:36 - 41:56

And then you you have no idea whether you can make progress there or not. But once you did, you sort of understand what the problem is. And you and that's that's like a lot of the work in in physics. And I and the same thing is true in AI. You have these crazy, huge systems that have all sorts of interesting phenomena, and and, you know, if you think about it the right way, they don't grok.

Speaker 141:36 - 41:56

然后你其实并不知道自己在那里能不能取得进展。但一旦做到了，你某种程度上就理解了问题是什么。而这就是 physics（物理学）里很大一部分工作。我觉得 AI 也是一样。你面对的是这些疯狂、庞大的系统，里面有各种有趣的现象；而且你知道，如果你用对了思路来看，它们并不是无法理解。

Speaker 141:56 - 41:57

There's just this nice continuity.

Speaker 141:56 - 41:57

这里面有一种很美的连续性。

Speaker 241:57 - 42:08

Do you think there could be an equivalent in AI to thermodynamics, meaning, you know, a compact theory that predicts behavior without tracking every individual bit?

Speaker 241:57 - 42:08

你觉得 AI 里会不会有一个相当于 thermodynamics（热力学）的东西？也就是说，有一个紧凑的理论，不需要追踪每一个单独的 bit（比特），就能预测系统行为。

Speaker 142:08 - 42:50

Yeah, Kaplan Mechanelich's OpenAI scaling laws work originally is is a version of this where you throw away, you know, all you know about the network is how many parameters it is and how how much you've how much data you've trained it on it, and you you can predict, like, the the final loss. I think the the missing piece is going from all the individual weights and biases and and how does that add up to the scaling law. I have some very, like, initial work with and there's some other initial work about, like, trying to bridge that connection. But, like, I think that's that's the missing piece, like, sort of statistical mechanics to thermodynamics of how do we like, how do these things emerge? But there's definitely a lot of useful effective descriptions of how these systems behave.

Speaker 142:08 - 42:50

会的。Kaplan 和 Mechanelich 在 OpenAI 最初关于 scaling laws（缩放定律）的工作，就是这种思路的一个版本：你把大量细节都丢开，你对网络所知道的，基本上只有它有多少 parameters（参数），以及你用多少 data（数据）训练了它，然后你就能预测，比如最终的 loss（损失）。我认为缺失的一块，是如何从所有单独的 weights（权重）和 biases（偏置）出发，得到 scaling law。我自己有一些非常初步的工作，另外也有一些别人的初步工作，在尝试打通这层联系。但我觉得这就是缺失的那一块，有点像从 statistical mechanics（统计力学）到 thermodynamics（热力学）的关系：这些东西究竟是怎样涌现出来的？不过，关于这些系统如何行为，确实已经有很多很有用的有效描述。

Speaker 142:51 - 43:08

I think the other part of your question is, like, is it is enough to characterize everything that we care about. Right? There's probably a lot there's a lot that we care about other than just the final loss function. And so there's there's more thermodynamics to be worked out in addition to, like, how does the thermodynamics arise from the microscopic description.

Speaker 142:51 - 43:08

我觉得你问题的另一部分在于：这是否足以刻画我们关心的一切，对吧？我们关心的东西很可能远不止最终的 loss function（损失函数）。所以，除了要弄清 thermodynamics（热力学）如何从 microscopic description（微观描述）中产生之外，还有更多关于“热力学”的内容有待建立。

Speaker 243:08 - 43:26

So at at that conference a year ago, you you jokingly, predicted the nine years to Einstein level AI. What do you think, all jokes aside, we are on that on that spectrum of, just, AI, creating scientific discovery. I mean, that's where we started the conversation and curious about where this is going.

Speaker 243:08 - 43:26

所以，在一年前那场 conference（会议）上，你曾半开玩笑地预测过：到 Einstein 水平的 AI 还要九年。抛开玩笑不谈，你觉得在这个光谱上——也就是 AI 创造科学发现这件事上——我们现在大概处在什么位置？我的意思是，这也是我们对话开始时谈到的话题，我很好奇它接下来会往哪里发展。

Speaker 143:27 - 44:18

The joke maybe it's helpful to deconstruct a joke as as as it always is, but the the joke was that taking the doubling time for the amount of work a system can do autonomously and figuring out how long it would take us to get to a system that can think eight years on its own. Because Einstein spent eight years discovering general relativity, and I projected that out and it was, like, nine years from last year. So that that, like, something I I hate making predictions, but I'm pretty sure something will break before that. I I mean, in in general, we're not just gonna, like, set up a system and let it think autonomously for eight years, if anything, because, like, the systems eight years after will be so much more powerful that it probably doesn't make sense to let a system think for a certain amount. You know, there's this amount of time it takes for the system to improve.

Speaker 143:27 - 44:18

这个玩笑——也许像往常一样，把一个玩笑拆开讲反而有点帮助——说的是：拿一个系统能够自主完成的工作量的 doubling time（翻倍时间）来推算，看看要多久我们才能得到一个能够独立思考八年的系统。因为 Einstein 花了八年时间发现 general relativity（广义相对论），我把这个趋势外推了一下，结果大概是从去年算起九年后。所以，大概就是这么个意思。我不喜欢做预测，但我很确定在那之前肯定会有某些东西先失效。我的意思是，一般来说，我们不会真的把一个系统架起来，然后让它自主思考八年；如果真要说的话，也是因为八年后的系统会强大得多，以至于让一个系统思考固定那么久这件事本身可能就没有意义。你知道，系统完成改进本身是需要时间的。

Speaker 144:18 - 45:13

Then there's the amount of time it's thinking and like, probably when those cross, like, all these scaling laws are going to to break in in in certain ways. I do think that the kind of thing that that I was trying to talk about about, like, how we as physicists approach problems, that the structure and flavor of that is maybe different than here's, like, a very well defined thing and go and do a calculation, which is like what these Erdos problems are. I think probably we'll need to have some ideas to bridge from from one to the other. I don't think it's it's not obvious whether it has to be a discontinuous thing or or or smoothing, but, know, there's part of the scientific process, I think that the models haven't been imbued with yet. And I'm sure people are thinking about how to how to how to do that, you know, like what trying to get to what is the right question as opposed to, here's a well defined thing and go calculate, and some of that involves research taste.

Speaker 144:18 - 45:13

然后还有它实际进行思考所花的时间；而且很可能当这两条曲线相交时，这些 scaling laws（规模定律）会以某些方式失效。我确实觉得，我当时想表达的那种东西——比如我们物理学家是如何处理问题的——它在结构和风格上，也许和“这里有一个定义非常清楚的问题，你去做计算吧”这种模式不一样，而后者有点像这些 Erdos problems。我想，我们大概需要一些想法，来把这两者连接起来。我不认为现在能明显看出这一定会是一个不连续的跃迁，还是一个平滑的过渡；但我觉得，科学过程中的某些部分，模型目前还没有真正被赋予。而且我相信人们正在思考该怎么做到这一点，比如努力弄清楚“正确的问题是什么”，而不是“这里有个定义清楚的问题，你去算”，其中有一部分还涉及 research taste（研究品味）。

Speaker 145:13 - 45:16

That's not an easy, easily verifiable thing.

Speaker 145:13 - 45:16

那不是一件容易验证的事。

Speaker 245:16 - 45:22

Is that what would convince you that AI is doing genuine original science?

Speaker 245:16 - 45:22

那会是让你相信 AI 正在做真正原创科学的证据吗？

Speaker 145:22 - 45:51

No, I'm convinced, and I think we're going to like, like, this is clearly I think the the unit distance problem is a is a is a great example. And unlike the also just being able to take a position that is contrary and think for an extremely long amount of time, explore lots of different options, and bring to bear the full weight of disparate fields, like, where something you know, it's very unlikely to find a human that has the exact set of skills to solve some of these problems. Right? That's a huge that's a huge thing.

Speaker 145:22 - 45:51

不，我已经相信了，而且我觉得我们会——这显然已经在发生。我认为 unit distance problem 就是一个很好的例子。除此之外，AI 还能采取一种相反的立场，进行极长时间的思考，探索大量不同选项，并调动彼此分散的各个领域的全部力量；因为你知道，要找到一个恰好具备解决某些问题所需全部技能组合的人类，概率是非常低的，对吧？这是非常非常重要的一点。

Speaker 245:51 - 46:02

How far do you think we are from AI research actually automating itself? Not not just AI researchers using AI, but, like, AI autonomously building AI.

Speaker 245:51 - 46:02

你认为我们距离 AI research（AI 研究）真正实现自我自动化还有多远？不是指 AI 研究者使用 AI，而是说 AI 自主地构建 AI。

Speaker 146:02 - 46:41

Yeah, I think it's again, one of these smooth things where we're like, it's already doing pieces of it now, it'll do more in the future. And there's, I know there's strong versions of this that that people like to think about, but, know, I'm not sure that we'll see like a really sharp phase transition versus just more and more pieces of, you know, right now a lot of coding that would take people weeks can be done very efficiently with models. So like some of these math discovery problems, there's also versions of this where for engineering, model, the models are playing a more central role. And so I think that will just, there'll just be more of that. I think that there's a kind of scientific thinking that, that humans still seem to be very useful for doing.

Speaker 146:02 - 46:41

是的，我觉得这又是那种平滑推进的事情：它现在已经在做其中一部分了，未来会做得更多。我知道，人们喜欢设想这种事情的强版本，但我不确定我们会看到那种非常尖锐的 phase transition（相变），还是说只是越来越多的环节被它接管。比如现在，很多原本要花人类几周时间完成的 coding（编码）工作，模型已经可以非常高效地完成。所以像这些数学发现问题，在 engineering（工程）方面也有类似版本——模型正在扮演更核心的角色。所以我觉得这种情况只会越来越多。不过我也认为，仍然有一种 scientific thinking（科学思维）是人类看起来依然非常擅长、也非常有用的。

Speaker 146:41 - 47:08

And, you know, I don't wanna make specific predictions about when or how. I can imagine, like, it's you don't wanna be caught on record saying the models won't be good at something because you'll you'll definitely be wrong. Or maybe I I should say that, and then the models will be good that at immediately. And so I should pick the things that I want the models to do and say that they'll never do that. I think it's also just hard to make predictions, because I think the way in which people made predictions before, like the actual ways things shook out often are are not in that direction.

Speaker 146:41 - 47:08

而且，你知道，我不想对时间点或具体方式做出明确预测。我能想象的是，你最好不要留下记录说模型不擅长某件事，因为那你大概率一定会错。或者也许我就该这么说，这样模型马上就会立刻擅长那件事了。所以我也许应该挑那些我希望模型去做的事情，然后说它们永远做不到。我觉得预测之所以也很难，是因为以前人们做预测的方式，和事情最终实际展开的方式，往往并不是沿着那个方向发展的。

Speaker 147:08 - 47:41

And so, know, it's it's another sort of credit assignment thing. Like if you have this long chain of things that has to happen for whatever happened, then anything that breaks that chain means your prediction, like it's just it's just way off. And so but in the, you know, in the I can make a very long distance prediction, you know, for the next six months, like, I think we'll see more of these sorts of math and science breakthroughs. And obviously, we'll turn this sort of thing on on AI itself, and the models will get a lot more powerful. And that'll be fun, you could think about, you know, that you could do science of AI and have it feel like doing doing physics.

Speaker 147:08 - 47:41

所以，你知道，这又是一种 credit assignment（归因）问题。比如说，如果某件事的发生依赖于一长串必须依次发生的环节，那么这条链上任何一个环节断掉，就意味着你的预测会严重偏离。话虽如此，但从另一面看，你知道，我还是可以做一个时间跨度很长的预测，比如对接下来六个月，我认为我们会看到更多这类数学和科学上的突破。显然，我们也会把这类方法用到 AI 本身上，model 会变得强大得多。那会很有意思；你可以想象，去研究 AI 的科学，会让人感觉像是在做 physics。

Speaker 147:41 - 48:18

And that's true. Another really exciting thing is that, like, entered physics thinking that I would, you know, when you first start learning a field, and maybe you want to commit to it, at least the perspective I had is that, oh, by the time I get to the end, I'll know all the answers, right? All the all the fundamental questions, obviously, like this is a journey, and at the end of the journey, it'll resolve. And then, I don't know, maybe it was in grad school or maybe when I switched to AI, I realized, oh, like, some of these questions will stay open maybe forever. Maybe I'll never get to learn the answers, you know, watching older colleagues as well start to retire and and realize that they may not get to learn the answers.

Speaker 147:41 - 48:18

没错。另一件非常令人兴奋的事是，我当初进入 physics 时，心里会觉得——你知道，当你刚开始学习一个领域，而且也许想长期投入其中时，至少以我当时的视角来看，会觉得，哦，等我走到最后，我就会知道所有答案，对吧？所有那些根本性问题，显然，这是一段旅程，而在旅程的终点，这些问题都会得到解决。后来，我不确定是不是在读 grad school 的时候，还是在我转到 AI 之后，我意识到，哦，其中一些问题也许会永远悬而未决。也许我永远都学不到这些答案；你知道，看着一些年长的同事开始退休，也会意识到，他们可能也无缘知道这些答案。

Speaker 148:18 - 48:35

But I feel really excited that that, you know, we will get to really answer a lot of fundamental questions in in the field of science that that we care about with with the aid or maybe the models being the driving force. And so that, yeah, that that's just really thrilling.

Speaker 148:18 - 48:35

但我确实感到非常兴奋，因为，你知道，在 model 的帮助下，或者甚至由 model 成为主要驱动力，我们将真的能够回答科学领域中许多我们关心的根本问题。所以，是的，这实在令人非常激动。

Speaker 248:35 - 48:43

Well, that feels like a wonderful place to live it. Dan, you gave us plenty to ponder. Really appreciate your spending time with us today. Thank you.

Speaker 248:35 - 48:43

好，那感觉是个非常合适的结束点。Dan，你给了我们很多值得思考的内容。非常感谢你今天抽时间和我们交流。谢谢。

Speaker 148:43 - 48:44

Thanks for inviting me. It was a pleasure.

Speaker 148:43 - 48:44

谢谢邀请我。很高兴参与。

Speaker 248:46 - 49:04

Hi. It's Matt Turk again. Thanks for listening to this episode of the MAD podcast. If you enjoyed it, we'd be very grateful if you would consider subscribing if you haven't already or leaving a positive review or comment on whichever platform you're watching this or listening to this episode. This really helps us build a podcast and get great guests.

Speaker 248:46 - 49:04

大家好，我是 Matt Turk。感谢你收听这一期 MAD podcast。如果你喜欢这一期内容，而你还没有订阅的话，我们会非常感激你考虑订阅；或者也欢迎你在你观看或收听这一期节目的平台上留下正面的评价或评论。这对我们把 podcast 做起来、邀请到优秀嘉宾真的很有帮助。

Speaker 249:04 - 49:05

Thanks, and see you on the next episode.

Speaker 249:04 - 49:05

谢谢，我们下期节目再见。

原文 ↗https://www.youtube.com/watch?v=oWOz2htozfI