BuildSpeak每日 builder 文摘
今日归档生词本关于
🎙 播客Training Data· 2026 年 6 月 16 日· 6,841 词 · 约 34 分钟

Simulating Humans at Scale: Simile's Joon Sung Park

SPACE 播放 / 暂停·←→ 上一句 / 下一句
Speaker 100:00 - 00:35
I am somebody who is quite inspired by science fiction. And when you read science fiction that covers societies that have progressed far enough in its technological maturity, you always see two pillars. You have some version of AGI, and you have some version of simulations that really help guide the society. I do see an opportunity today to really take the first crack at building the simulation. I would not have said that even five years ago, but that is a conviction that we have built up over the years as we're going deep into this research.
Speaker 100:00 - 00:35
我是一个深受 science fiction 启发的人。你在读那些描绘社会在技术成熟度上已经发展到相当高阶段的 science fiction 时,总会看到两大支柱:一种某种形式的 AGI,以及某种真正能帮助社会进行引导的 simulations(模拟)。我确实觉得,今天我们已经有机会第一次真正着手去构建这种 simulation。哪怕五年前我都不会这么说,但这是我们这些年随着在这项研究中不断深挖而逐渐建立起来的一个坚定判断。
Speaker 200:52 - 01:04
Today, we're delighted to have June, founder and CEO of Simile. Simile is building an applied AI lab simulating human behavior and societies. And I'm very excited to have you here to discuss what you're building.
Speaker 200:52 - 01:04
今天,我们非常高兴邀请到 Simile 的 founder and CEO June。Simile 正在打造一家 applied AI lab(应用型 AI 实验室),专注于模拟 human behavior(人类行为)和 societies(社会)。我也非常期待你今天来和我们聊聊你们正在构建的东西。
Speaker 101:04 - 01:05
Same here. Thanks for having me.
Speaker 101:04 - 01:05
我也是。感谢邀请我来。
Speaker 201:05 - 01:13
Okay. Take me back to April 2023, Stanford, California, specifically Smallville, Stanford, California. What was that?
Speaker 201:05 - 01:13
好,我们把时间拨回到 2023 年 4 月,Stanford,California,更具体地说,是 Smallville,Stanford,California。那到底是什么?
Speaker 101:13 - 02:10
So Smallville was a project that we were running at Stanford, where the idea was that we made this observation that large language models can now encode a lot of human behavior that is embedded in its training data from the web and social media and so forth, that if you sort of prove at the right angle, you can actually get a lot of microbehaviors out of these models. So given a very specific demonstration or description of a situation, what would person X do, and it would actually generate really interesting behaviors? We found that to be so interesting, and we found that to be the ingredient that we had been waiting for for creating really complex agentic behaviors. So Smallpit actually was an experiment where we decided that if we push this as far as possible, what would a society that is created by these agents look like? So we basically created generative agents that is paired with generative AI model with memory, planning, and reflection to basically create this lived experience of agents living in this small town.
Speaker 101:13 - 02:10
Smallville 是我们当时在 Stanford 开展的一个项目。它的核心想法是:我们观察到,large language models 现在已经能够编码大量人类行为,而这些行为嵌入在它们来自 web、social media 等来源的训练数据中;如果你从合适的角度去 probing(探测)它们,实际上就能从这些模型里提取出很多微观行为。所以,给定一个非常具体的情境示范或描述,比如“person X 在这种情况下会怎么做”,模型真的会生成非常有意思的行为。我们觉得这件事特别有意思,也认为这正是我们一直在等待的、用来创造真正复杂的 agentic behaviors(agent 式行为)的关键成分。于是,Smallville 实际上就是这样一个实验:如果我们把这件事尽可能推到极致,由这些 agents 创造出来的一个 society 会是什么样子?所以我们基本上构建了 generative agents,也就是把 generative AI model 与 memory、planning 和 reflection 配对起来,从而创造出一种 agents 居住在这个小镇中的“真实生活体验”。
Speaker 102:10 - 02:29
So Smallville was basically a game town of 25 agents living in it. Individual agents had a description of persona, but they would actually wake up the morning, do their routines, go to work, actually have relationships, sort of like people would, and they would actually have emergent phenomena, like having parties and so forth. So that was the experiment that we ran.
Speaker 102:10 - 02:29
所以 Smallville 本质上就是一个由 25 个 agents 居住其中的游戏小镇。每个 agent 都有一段 persona(角色设定)描述,但他们真的会在早晨醒来,执行自己的日常安排,去工作,建立关系,就像人们那样;同时还会出现一些 emergent phenomena(涌现现象),比如举办派对之类的事情。这就是我们当时运行的实验。
Speaker 202:29 - 02:33
What was the most surprising things to come out of the experiment?
Speaker 202:29 - 02:33
这个实验中最让人惊讶的结果是什么?
Speaker 102:34 - 03:06
So, one of the surprising things was, so the experiment, the simulation itself actually sets place the day before Valentine's Day. So you actually see these agents, one of the agents actually thinking, Well, I run a cafe, so she's a cafe owner. Her name's Isabella. She goes and thinks, It would be great if I can do a Valentine's Day party where we invite a lot of friends, customers. So you actually see her on the day before Valentine's Day going around, actually gathering materials for the party, actually telling our customers, Hey, we're going to have this party.
Speaker 102:34 - 03:06
其中一个令人惊讶的点是,这个实验、这个 simulation 本身设定在 Valentine's Day 前一天。所以你会看到这些 agents 里,有一个 agent 会想:“我经营着一家 cafe。”她是一位 cafe 店主,名字叫 Isabella。她会去想:“如果我能办一场 Valentine's Day party,邀请很多朋友和顾客来,那就太好了。”于是你真的会看到,在 Valentine's Day 前一天,她四处去为这场派对筹备材料,还会告诉我们的 customers:“嘿,我们要办这个派对了。”
Speaker 103:06 - 03:14
Please come. And on the day of Valentine's, you actually see this immersion party that actually gets formed with all these agents coming to the baseplate cafeteria.
Speaker 103:06 - 03:14
请来吧。而在情人节那天,你实际上会看到这样一个沉浸式派对真的形成了:所有这些 agent 都来到 baseplate cafeteria。
Speaker 203:14 - 03:15
Did anyone not get invited?
Speaker 203:14 - 03:15
有没有谁没被邀请?
Speaker 103:15 - 03:32
Well, some of the people did get the invitation, but they forgot. That's one thing that did happen. Some of the agents did not explicitly get invited, but we had one agent who got the invite, Klaus, who decided to ask his crush out on a date. So he would actually bring in the date. They would actually have a party at this cafe.
Speaker 103:15 - 03:32
嗯,有些人确实收到了邀请,但他们忘了。这的确是发生过的一件事。有些 agent 并没有被明确邀请,但我们有一个收到了邀请的 agent,Klaus,他决定约自己暗恋的人出去约会。所以他实际上会把约会对象带来。他们真的会在这家 cafe 里办派对。
Speaker 103:32 - 03:33
So quite surreal.
Speaker 103:32 - 03:33
所以这相当超现实。
Speaker 203:33 - 03:45
So how'd you end up building's model in the first place? Like, you studying kind of human psychology and social behavior, or was the this coming customer back, or was it coming from the technology out?
Speaker 203:33 - 03:45
那你最初是怎么走到构建这个 model 的?比如说,你是在研究某种 human psychology(人类心理)和 social behavior(社会行为),还是这是从 customer feedback(客户反馈)倒推出来的,还是从 technology(技术)本身向外发展出来的?
Speaker 103:45 - 04:04
So my particular team has been excited about simulations, and we saw the vision of simulation failure early on. So my career as a researcher at Stanford really started back in 2020. That was the year when GPT-three was about to come out. It wasn't quite there yet, but it was just about to come out. We started to get its first demos.
Speaker 103:45 - 04:04
我所在的团队一直都对 simulation(模拟)很感兴趣,而且我们很早就看到了 simulation failure(模拟失效)的图景。我的研究员生涯其实是在 Stanford 于 2020 年真正开始的。那一年,GPT-three 即将发布。它当时还没真正出来,但已经非常接近了。我们开始看到它的第一批 demo(演示)。
Speaker 104:04 - 04:55
And my first year, we wrote this paper called Opportunities and Risks of Foundation Model, alongside many of the Stanford researchers, and was led by one of my co founders, Percy Liang, who is now the head of the Center for Foundation Model at Stanford. And when we were writing that, the part that I was really focused on was, well, here's a new class of models that we have not seen in the past, that these models that can be very generalizable in ways we didn't quite have in the past. And I got into thinking, well, if we can imagine the kind of interaction we can create with these models, what would that be? And many of my colleagues back then were surprised that these agents or these models can do classification or simple generation, and that was really incredible to see because these models didn't really know or wasn't really taught to do that. But the part that was surprising to me wasn't that these models can do that, because from interaction perspective, we've known how to do this for a long time.
Speaker 104:04 - 04:55
在我入学第一年,我们和许多 Stanford 的研究者一起写了一篇论文,叫做 Opportunities and Risks of Foundation Model,这篇工作由我的一位联合创始人 Percy Liang 领衔,他现在是 Stanford 的 Center for Foundation Model 负责人。写这篇论文时,我真正聚焦的一点是:这里出现了一类我们过去从未见过的新 model,这些 model 具备一种过去并不真正拥有的广泛泛化能力。于是我开始思考,如果我们设想能与这些 model 创造出怎样的 interaction(交互),那会是什么样?当时我的很多同事都对这些 agent 或这些 model 居然能做 classification(分类)或简单 generation(生成)感到惊讶,看到这一点确实非常不可思议,因为这些 model 实际上并不真正“知道”怎么做,也没有被真正教过这样做。但对我来说,令人惊讶的并不是这些 model 能做到这一点,因为从 interaction 的角度看,我们其实早就知道如何做到这些了。
Speaker 104:55 - 05:43
The interesting part was, well, these models can actually encode human behavior. What does that mean if we were to push this as far as possible? So part of the tradition that come from research included what we call social computing. And social computing within human computer interaction really has to do with this idea of how can we build a better technological platform that would enable social interactions and collaboration. One of the most difficult challenges of building a social platform is not necessarily testing the UI UX of the system, but it's more about when you have tens of people, millions of people, and down the line billions of people, how do all these people come together to create the emerging phenomenon that's both good and bad, and how can we design for a scale?
Speaker 104:55 - 05:43
真正有意思的部分在于,这些 model 实际上能够编码 human behavior(人类行为)。如果我们把这件事尽可能推到极限,那意味着什么?所以,我所来自的研究传统里有一部分包含了我们所谓的 social computing(社会计算)。而 human computer interaction(人机交互)中的 social computing,核心其实与这样一个想法有关:我们怎样构建一个更好的技术平台,来促成 social interactions(社交互动)与 collaboration(协作)。构建 social platform(社交平台)时,最困难的挑战之一未必是测试系统的 UI UX,而更多是在于:当你面对几十个人、几百万人,乃至未来的几十亿人时,所有这些人如何共同汇聚,创造出既有好的一面也有坏的一面的 emergent phenomenon(涌现现象),以及我们又该如何为这种 scale(规模)进行设计?
Speaker 105:44 - 06:20
And so far, we didn't really have a tool that would enable us to test for that. The only way we test it today is you basically field test it. You release your prototype, see what happens, and sometimes it actually comes at a real cost. Obviously, it's high cost in terms of human hours and the time it takes, but at the same time, if you have a bad design, imagine you have a feed on social media that is more likely to propagate certain emotion that is negative, then obviously that is something that we want to avoid, but this now gets tested in the in the field. So we wanted to see whether we can actually create a simulation that would actually let you test for this.
Speaker 105:44 - 06:20
到目前为止,我们其实并没有一个真正能让我们测试这类问题的工具。今天测试它的唯一方式,基本上就是做 field test(实地测试)。你发布原型,看看会发生什么,而有时候这确实会带来真实代价。显然,这在人力投入和所花时间上成本很高;同时,如果你的设计不好,比如你在社交媒体上有一个 feed(信息流)更容易传播某种负面情绪,那显然是我们想要避免的事情,但现在这类问题却是在真实环境中被测试出来的。所以我们想看看,是否真的能创建一种 simulation(模拟),让你可以提前对此进行测试。
Speaker 106:20 - 06:48
So 2022, this was actually a year before generative agents, we worked on a paper called Social Simulacra, which actually really was the precursor to the agent paper that we ended up writing. The core thesis was, imagine you're building a subreddit. You're a designer on a subreddit. You want to see what people might do in the subreddit, which is surprisingly hard task, even for practice designers. And we basically decided, hey, we have this model, seems unique, let's use this model to create simulations of the entire subreddit.
Speaker 106:20 - 06:48
所以在 2022 年,也就是 generative agents(生成式 agent)之前一年,我们做了一篇叫做 Social Simulacra 的论文,它实际上就是后来那篇 agent 论文的前身。核心论点是,假设你在构建一个 subreddit。你是一个 subreddit 的设计师。你想看看人们可能会在这个 subreddit 里做什么,而这项任务出奇地困难,哪怕对有经验的设计师也是如此。于是我们基本上决定,既然我们有这样一个看起来很独特的 model(模型),那就用这个 model 来创建整个 subreddit 的模拟。
Speaker 106:49 - 07:30
So you define the goal, you define the moderation strategies, and you populate it with thousands of, back then we didn't call them agents, but we call them personas, but populated with thousands of personas. This is basically 22 version Mode Book, which is quite interesting that it actually came back, and when we saw that, we actually got a lot of really important insights out of this. What are the good behaviors? We actually stimulated a community where the entire idea was for people to discuss with each other the places to sightsee in Pittsburgh. And all of a sudden, you start to see these personas actually collaborate to actually discuss, Hey, XYZ places are amazing, do you want to actually go to a trip together?
Speaker 106:49 - 07:30
所以你定义目标,定义 moderation strategies(版主管理策略),然后往里面填充数以千计的——当时我们还不叫它们 agents(agent),而是叫 personas(人物设定)——总之就是填充数千个 personas。这基本上就是 22 版的 Mode Book,很有意思的是它后来竟然又回来了。我们看到这个之后,确实从中获得了很多非常重要的洞见。什么是好的行为?我们实际上模拟了一个社区,整个设定是让人们互相讨论 Pittsburgh 有哪些地方值得 sightseeing(观光)。然后突然之间,你开始看到这些 personas 真的会合作起来讨论:某某某地方很棒,你要不要一起去旅行?
Speaker 107:30 - 07:57
And actually plan those trips live in the simulator subreddit. So that's how we got excited. So we saw the vision and the excitement and the potential applications fairly early on, but then the work that we had to do was then demonstrating how can we go beyond simple personas to create complex agents that actually can think over time, because we want to simulate the longitudinal aspect of our society, and then actually validating that these simulations are actually accurate in practice.
Speaker 107:30 - 07:57
他们甚至会在模拟器里的 subreddit 中现场规划这些行程。这就是我们开始兴奋起来的原因。所以,我们很早就看到了这个愿景、这种兴奋感,以及潜在应用;但接下来我们必须完成的工作,是证明我们如何能从简单的 personas 进一步走向复杂的 agents,也就是那些能够随着时间推移进行思考的 agent,因为我们想模拟社会的 longitudinal(长期演化)层面;然后还要验证,这些模拟在实践中是否真的准确。
Speaker 207:57 - 08:08
Was there a point of model evolution at which you felt like, Okay, we're there. The models are good enough for us to actually have a faithful representation of human society.
Speaker 207:57 - 08:08
在 model(模型)演化的过程中,有没有某一个时点让你觉得:好,我们到了。模型已经足够好了,足以让我们真正忠实地表征人类社会。
Speaker 108:08 - 08:27
So, GPT-three, when it came out, and Social Simulacra was built with GPT-three, and it was very janky. It didn't do any instruction tuning. It did not follow your instructions. So just to have it to listen to you and do what you want it to do, you had to do some weird tricks with prompting and so forth. But you could actually see the promise.
Speaker 108:08 - 08:27
GPT-three 刚出来的时候——而 Social Simulacra 就是用 GPT-three 构建的——它非常 janky(不稳定、不顺手)。它没有做 instruction tuning(指令微调),也不会遵循你的指令。所以,为了让它听你的话、做你想让它做的事,你不得不用一些奇怪的 prompting(提示)技巧之类的方法。但你当时确实已经能看到其中的潜力。
Speaker 108:27 - 08:56
The model actually have encoded a lot of human behavior, and you could actually see the trajectory. And when we had the generative agents paper, it wasn't quite JBT, but we now had instruction tuning, so we could actually build much more complex agents that can reason about its memory. That wasn't really possible when we did social simulacra. And since then, of course, the models have improved. So where we are today is the models at its foundation level have reached a point where we can actually imagine building these kind of applications.
Speaker 108:27 - 08:56
这个 model 实际上已经编码了大量人类行为,你也确实能看到它的发展轨迹。等到我们写 generative agents 那篇论文的时候,虽然还不完全是 JBT,但我们已经有了 instruction tuning,因此我们实际上能够构建复杂得多的 agents,它们可以基于自己的 memory(记忆)进行推理。而这在我们做 Social Simulacra 的时候基本上还做不到。当然,从那以后模型也一直在进步。所以我们今天所处的阶段是:模型在 foundation level(基础层)上已经达到了一个程度,让我们可以真正设想构建这类应用。
Speaker 108:57 - 09:24
Now the part that actually I do think, however, that's quite interesting here, today, if you look at many of the large language model companies, whether it's OpenAI, Anthropic, and many of the new labs that are getting formed, the models they are creating are models that I would consider to be their north star to be something that is similar to, let's build a superintelligent machines. These machines are meant to be rational, and these machines are supposed to be really amazing at technical problems that have an objective answer.
Speaker 108:57 - 09:24
不过,我确实认为这里还有一个相当有意思的部分。今天,如果你去看很多 large language model(大语言模型)公司,无论是 OpenAI、Anthropic,还是许多新成立的 labs(实验室),它们正在创造的模型,在我看来,它们的 north star(北极星目标)都更像是:来构建 superintelligent machines(超级智能机器)。这些机器的设计目标是理性的,这些机器也应该特别擅长处理那些有客观答案的技术性问题。
Speaker 209:25 - 09:29
So maybe that's not even the best simulation of true human society then.
Speaker 209:25 - 09:29
所以,也许那甚至都不算是真实人类社会的最佳模拟。
Speaker 109:29 - 10:02
Turns out, people are irrational. We have a lot of subjective values, preferences and tastes. So you actually start to see divergence in model size going up and the performance in its ability to predict and simulate human behavior. So we have sort of plateaued with current modelling paradigm, our ability to really simulate humans. So it is sort of at the starting good foundational level, but to make it really amazing, we do need the next frontier that is more geared towards actually modelling people's diversity.
Speaker 109:29 - 10:02
结果发现,人是非理性的。我们有很多主观价值、偏好和品味。所以你实际上会开始看到,随着 model(模型)规模增大,它在预测和模拟人类行为方面的表现反而出现分化。因此,用当前的 modelling paradigm(建模范式),我们在真正模拟人类这件事上的能力,某种程度上已经进入平台期了。也就是说,它作为起步阶段一个不错的基础层面是好的,但如果要把它做得真正惊艳,我们确实需要下一个 frontier(前沿),也就是更面向对人类多样性进行建模的方法。
Speaker 210:02 - 10:10
Very interesting. At what point did he realise that, you know, what you did with Smallville could become a company?
Speaker 210:02 - 10:10
很有意思。他是在什么时候意识到,你们在 Smallville 上做的事情可以变成一家公司?
Speaker 110:10 - 10:44
Right. So, again, the promise of application was something that I was very much inspired by early on by simulation with social simulacra and so forth. But the part that I realized over time is research and a company have very different function. Research is an amazing vehicle if you want to basically do breadth first search. You are in a lab surrounded by a really smart set of people, and each of our each of the researchers own a small piece of thesis, and they go explore some of those thesis blossoming to amazing research product.
Speaker 110:10 - 10:44
对。所以再说一次,我很早就深受 application(应用)前景的启发,比如用 social simulacra(社会拟像)进行 simulation(模拟)等等。但我后来逐渐意识到的是,research(研究)和 company(公司)的功能非常不同。如果你本质上想做 breadth first search(广度优先搜索),research 是一种非常棒的载体。你身处实验室,周围是一群非常聪明的人,而我们每一位 researcher(研究者)都有自己的一小块 thesis(研究命题),然后各自去探索,其中有些命题会发展成非常出色的研究成果。
Speaker 110:45 - 11:01
But we're not necessarily known for finishing our job. We're not usually the one to bring that research impact to the real world. Company is a machine for depth first search. You have a conviction on an area. You find a hill that you want to climb.
Speaker 110:45 - 11:01
但我们未必以“把工作做完”而闻名。通常也不是我们把 research(研究)的影响真正带到现实世界中去。company(公司)则是一台做 depth first search(深度优先搜索)的机器。你对某个领域有坚定信念,你找到一座自己想爬的山。
Speaker 111:01 - 11:45
This is the vehicle that let you put together resources and an amazing group of people to go after a singular vision without hesitation. And we got that conviction, I would say, about half a year after generative agents. After the original generative agents paper, we got so much inbound interest initially from actually social scientists who wanted to run their experiments and all the r c RCTs on our platform. Then very soon after, many of the Fortune 500 companies who saw this demo and their board members and CEOs who sometimes visit Stanford saw that, and they started asking, well, we go run all these surveys and experiments, and there's so many research questions about the market that we cannot answer today. Can we run that in simulation?
Speaker 111:01 - 11:45
这是一个能让你整合资源、召集一群了不起的人,并且毫不犹豫地去追求一个单一愿景的载体。而我会说,我们是在 generative agents(生成式 agents)发布大约半年后形成这种信念的。在最初那篇 generative agents 论文之后,我们一开始收到了大量主动找上门的兴趣,实际上主要来自 social scientists(社会科学家),他们想在我们的平台上运行他们的实验,以及各种 RCTs(随机对照试验)。紧接着不久,很多看过这个 demo(演示)的 Fortune 500 companies(《财富》500 强公司),以及一些会到 Stanford 访问的董事会成员和 CEO,也开始来问:我们会做这么多 survey(调研)和 experiment(实验),而且市场上有这么多今天还无法回答的研究问题。我们能不能在 simulation(模拟)里做这些事?
Speaker 111:45 - 12:24
That started to really intrigue me because that showed a clear line towards a real world impact for research, which is not always the case that we have that kind of opportunity. So that is when we decided we actually want to validate the simulations are accurate. So we went out and actually created simulations of 1,000 people of The US population. We demonstrated that using our architecture and the models, we can actually predict people's behaviors 85% as accurately as people replicate their own. When we saw that, we thought, okay, this is something that we feel comfortable providing to our users as a platform for assimilating their really important decisions.
Speaker 111:45 - 12:24
这开始真正让我着迷,因为这表明 research(研究)通向现实世界影响有一条清晰路径,而我们并不总是能拥有这样的机会。所以也正是在那个时候,我们决定真正去验证这些 simulation(模拟)是否准确。于是我们实际构建了一个由 1,000 名美国人口组成的模拟。我们证明了,使用我们的 architecture(架构)和这些 models(模型),我们实际上可以以达到人们自我复现准确度 85% 的水平来预测他们的行为。看到这个结果时,我们想,好,这已经是我们愿意放心提供给用户的平台了,让他们用它来模拟自己那些非常重要的决策。
Speaker 112:24 - 12:42
So that's when the cofounders, myself, Percy, as well as Michael Bernstein, was a researcher and my adviser at Stanford. Both of them were actually my advisers. So the three of us have been working together for five years, and now at this point, a similarly six years. But that's when we got together to have the initial conversation of, Can this be a company?
Speaker 112:24 - 12:42
也就是在那时,几位 cofounders(联合创始人)——我自己、Percy,以及 Michael Bernstein——开始了最初的讨论。Michael Bernstein 当时是 Stanford 的研究者,也是我的 adviser(导师);其实他们两位都是我的导师。所以我们三个人已经一起合作了五年,而到现在差不多是六年了。但正是在那个时候,我们坐下来第一次认真讨论:这能不能成为一家公司?
Speaker 212:42 - 12:58
Got it. Amazing. Maybe walk me through a customer engagement end to end today. Like, who's a canonical customer and which department? And they come to you, what are they asking you, and what product or service do you deliver to them?
Speaker 212:42 - 12:58
明白了。太棒了。也许你可以带我从头到尾讲一遍你们今天一次客户合作(customer engagement)的完整流程。比如说,谁是一个典型客户,来自哪个部门?他们来找你们时,会提出什么需求?而你们最终向他们交付的是什么产品或服务?
Speaker 112:58 - 13:39
Right. So maybe an example that I can give to make this concrete. So CVS has been partnering with Simuli for the past, I would say, nearly half a year, and they've been an amazing partner. The way we initially got in touch with, so our main buyer at CVS is the lead, is a senior VP who leads human insights. And the original story there was he he basically read my paper that validated the agent simulations and thought, We have to bring this to CVS because today we are bottlenecked by the number of questions we can field test, and we are also bottlenecked by truly the physics of human society.
Speaker 112:58 - 13:39
好的。那我举一个例子,把这件事讲具体一点。CVS 在过去差不多半年的时间里一直在和 Simuli 合作,他们是非常棒的合作伙伴。我们最初是这样和他们建立联系的:我们在 CVS 的主要买方是一位负责人,也就是主管 human insights 的 senior VP。最初的起因是,他读了我那篇验证 agent simulations(agent 仿真)的论文,然后觉得,我们必须把这个带到 CVS,因为今天我们既受限于能够实地测试的问题数量,也受限于人类社会本身的“物理规律”。
Speaker 113:39 - 14:24
It's one thing to ask surveys and experiments, totally different thing if down the line you actually want to simulate the entire market and actually map out all the second order impact of the decisions you suggest to your leadership. So he's been looking around for that solution, and his cousin happened to know me and basically told our buyer, Sri, that the authors of the paper are actually looking to start something. So that's how we got connected. And in this particular engagement, usually the way this goes is our customers are very much used to working with polling companies or panel companies today. And there they go and basically ask these companies, x y z are the populations that we are interested in better understanding.
Speaker 113:39 - 14:24
做 surveys 和 experiments 是一回事,但如果你之后真正想模拟整个市场,并且实际梳理出你建议给管理层的决策会带来哪些二阶影响,那就是完全不同的另一回事了。所以他一直在寻找这样的解决方案,而碰巧他的 cousin 认识我,并且基本上告诉我们的买方 Sri,那篇论文的作者其实正打算开始做一家公司。于是我们就这样联系上了。在这个具体合作里,通常流程是这样的:我们今天的客户已经非常习惯和 polling companies 或 panel companies 合作。他们会去找这些公司,基本上说明,x y z 是我们想要更深入理解的人群。
Speaker 114:24 - 15:00
Can we go run a research study of these topics? That initial stage looks very similar for a simile. So our buyers come and they tell us, we want to better understand XYZ population. Then SIMILI goes out, and we have, through our partnership with vendors, we have a strategic partnership now with GALIP, for instance, who is a pooling and panel company, where we go out, work with our vendors to actually reach out to real humans. So these simulations are grounded in real data, but reach out to those people, collect data that we believe are efficient and generalizable about that person.
Speaker 114:24 - 15:00
我们能不能围绕这些主题做一个研究项目?对于 Simuli 来说,最初这个阶段看起来非常相似。我们的买方会来告诉我们,我们想更好地理解 XYZ 人群。然后 SIMILI 会出去执行;通过与供应商的合作,比如我们现在就与 GALIP 建立了战略合作关系——它是一家 polling 和 panel company——我们会和这些供应商合作,真正接触现实中的人。也就是说,这些 simulations 是建立在真实数据之上的;我们会联系这些人,收集那些我们认为既高效又可泛化、能够描述这个人的数据。
Speaker 115:00 - 15:26
So imagine you fifteen minutes, what are the magical questions you can answer or you can ask these people during that time? We collect that data, use that data to create agents or simulations of these people that can basically be used to answer a large number of questions that goes way beyond the original domain. We load that onto our platform, and it's basically a SaaS product. Our customers come and they can basically ask any questions about the group of people of their interest.
Speaker 115:00 - 15:26
你可以这样想:如果你只有十五分钟,你能在这段时间里回答哪些关键问题,或者向这些人提出哪些“最神奇”的问题?我们收集这些数据,再用这些数据创建这些人的 agents 或 simulations,使它们基本上能够用来回答大量问题,而且这些问题会远远超出最初的领域。我们把这些加载到我们的平台上,而它本质上就是一个 SaaS 产品。我们的客户进来之后,基本上可以就他们感兴趣的那群人提出任何问题。
Speaker 215:26 - 15:39
So interesting. It reminds me of, in autonomous vehicles, you you go and collect a bunch of data from the road and then you're able to augment it with simulation. Is this a similar concept or are there big differences to what you're doing here?
Speaker 215:26 - 15:39
很有意思。这让我想到 autonomous vehicles 领域:你会先从道路上收集大量数据,然后再用 simulation 来扩充它。这和你们这里做的是类似概念吗,还是说和你们现在做的事情之间有很大的区别?
Speaker 115:39 - 16:10
It is similar concept in the sense that, of course, the self driving vehicles, you want to create model that is based on real world physics, but you want to create a model that is generalizable beyond your training data. It needs to be generalizable in two different locations with different weather conditions, very similar concept, where what we want to create is we want to reach out to real people, and for these people, want to understand something fundamental about these people in a way that we can encode into the model.
Speaker 115:39 - 16:10
从概念上说是相似的。因为当然,在 self-driving vehicles 里,你想建立一个基于现实世界 physics 的模型,但你也希望这个模型能够泛化到训练数据之外。它需要能够泛化到不同地点、不同天气条件。这里的概念非常类似:我们想做的是接触真实的人,并且对这些人理解一些更根本的东西,以一种能够被编码进模型的方式来理解他们。
Speaker 216:11 - 16:37
I would have thought that the large language models would be such a good representation of the whole world that you could almost narrow it down. You could tell, Claude, you are a 34 year old woman living in Bicoastal Metropolitan Area, and it would be able to have a faithful representation. So I'm actually surprised that you go out to Gallup. Maybe can you just explain why you have to go out and collect any real world data at all?
Speaker 216:11 - 16:37
我原本会以为,large language models 对整个世界的表征已经足够好了,以至于你几乎可以把范围不断缩小。你可以告诉 Claude,你是一个生活在 Bicoastal Metropolitan Area 的 34 岁女性,而它就能够做出相当忠实的表征。所以我其实很惊讶你们还要去找 Gallup。也许你可以解释一下,为什么你们还必须出去收集任何现实世界数据?
Speaker 116:37 - 16:53
Yeah. One of the big questions here is the question around say do gap. There are things that people say, and then there are things that people actually do. And the gap, there is real. And a lot of the large language models are trained on attitudinal data.
Speaker 116:37 - 16:53
对,这里的一个大问题是所谓“说做差距”(say-do gap)。人们会说一些话,但他们真正会做的又是另一回事。而这种差距确实存在。很多 large language model 都是在态度型数据(attitudinal data)上训练出来的。
Speaker 116:53 - 17:25
Fundamentally, it is the things that people have said online that does cover a large quantity of its training data. So one of the things that Simuli's simulation platform does is actually closing that gap. So a lot of the data that we end up collecting, by nature, are behavioral. It also includes data that actually goes into literally questions like, Just tell me the story of your life. Turns out, if we understand the person's story of your life, the kind of data you get from it is what we consider to be the long tail information about this person.
Speaker 116:53 - 17:25
从根本上说,人们在网上说过的话,确实构成了其大量训练数据的一部分。所以 Simuli 的 simulation platform 所做的一件事,就是去真正弥合这种差距。因此,我们最终收集到的很多数据,在本质上都是行为型的(behavioral)。其中也包括一些会直接进入这类问题的数据,比如:“就跟我讲讲你的人生故事。” 结果发现,如果我们理解了一个人的人生故事,我们从中获得的数据,就是我们所认为关于这个人的“长尾信息”(long tail information)。
Speaker 117:25 - 17:51
It's not about what you've done in this particular moment. It's not about very broad questions like what's your view on politics. It's about where you grew up, what were some of the difficult decisions you had to make in life. And what's interesting about this data is it's an amazing way to build a translational layer between attitudes and behavior. So we combine these kind of data sets, but fundamentally, that's the gap that we want to close.
Speaker 117:25 - 17:51
这不是关于你在某一个特定时刻做了什么,也不是关于“你对政治怎么看”这种非常宽泛的问题,而是关于你在哪里长大、你人生中不得不做过哪些艰难决定。这类数据有意思的地方在于,它是构建态度与行为之间“转译层”(translational layer)的一种绝佳方式。所以我们会把这几类数据集合并起来,但归根结底,这就是我们想要弥合的那个差距。
Speaker 217:51 - 17:53
What sort of behavioral data do you have?
Speaker 217:51 - 17:53
你们掌握的是哪一类行为型数据?
Speaker 117:53 - 18:25
So SIMILI does run a lot of experiments. So kind of models that we have trained, for instance, we have a huge repo of RCTs, so randomized control trials that were run-in social scientific context, that were run around pricing studies. So one of the models that we are training is basically the foundation model of human behavior in quite a literal sense. We have all the behavioral signals from RCTs. Can we actually encode that into the model so that the end outcome is a model that can basically predict the results of any RCTs.
Speaker 117:53 - 18:25
所以 SIMILI 确实会运行很多实验。比如说,就我们训练过的这类模型而言,我们有一个很大的 RCT repo,也就是在社会科学语境中开展过的 randomized control trials,还包括围绕定价研究进行的试验。我们正在训练的模型之一,基本上可以说是在非常字面意义上构建“人类行为的 foundation model”。我们拥有来自这些 RCT 的所有行为信号(behavioral signals)。那我们能不能把这些编码进模型里,从而让最终产物变成一个基本上可以预测任意 RCT 结果的模型?
Speaker 118:25 - 19:07
At the same time, one of the conversations that we keep on having with our customers that we're very excited by is our customers then come in, see that potential, and their mind goes to, Wow, we have 90,000,000 customers, let's say, here at CVS. How can we leverage this kind of data to create better simulations? So there's also a conversation around how can we, in a responsible and ethical way, leverage existing data that is also in house for our customers, then use that to create augmented version of Simulis model. So that, of course, is going to be more fine tuned, specific to the population of these customers, but that's the kind of data that we will be leveraging.
Speaker 118:25 - 19:07
与此同时,我们一直在和客户进行、并且对此非常兴奋的一类对话是:客户会进来,看到这种潜力,然后立刻想到,“哇,比如说我们在 CVS 这里有 90,000,000 名客户。我们怎么利用这类数据来创建更好的 simulation?” 所以这里还有一个讨论,即我们如何以负责任且合乎伦理的方式,利用客户内部现有的数据(in-house data),再用这些数据去创建 Simulis model 的增强版本(augmented version)。这样一来,模型当然会更加 fine-tuned,更加贴合这些客户的人群特征,而这正是我们会去利用的那类数据。
Speaker 219:07 - 19:14
I see. And are you doing these interviews typically by voice? Is it a survey that you fill out? Like, what's the modality?
Speaker 219:07 - 19:14
明白了。那你们做这些访谈时通常是用语音吗?还是填写式 survey?也就是说,具体采用什么 modality?
Speaker 119:14 - 19:36
So it's a huge breadth. Or the quick answer here is it's both. Interviews are fantastic if you want to get the long tail information about people. So we actually do, in the original study that I conducted back in 2024, we'd literally ask a question, Tell me the story of your life. Now, the way we do it is we are training our own model.
Speaker 119:14 - 19:36
这个覆盖面非常广。或者更简短的回答是:两种都有。如果你想获得关于人的长尾信息,interview 是非常棒的方式。所以实际上,在我 2024 年做的最初那项研究里,我们真的是直接问一个问题:“跟我讲讲你的人生故事。” 而现在我们的做法是,我们在训练自己的 model。
Speaker 119:36 - 20:17
So it's a reinforcement learning loop, but basically imagine the objective function here is how can you spend the minimum amount of time to get the maximum amount of visibility about this person? So that is one of the things that we do. So basically training an interviewer that is not really asking for factual information or an experience about a particular platform, but just what are the life story that people have that can be used to train our own model for these agents. And then for the more factual or sort of more discrete choices, choice questions, surveys, so forth, these are also very efficient. These are time and data efficient because people can fill out many of the questions in short period of time.
Speaker 119:36 - 20:17
所以这是一个 reinforcement learning(强化学习)循环,但基本上可以把这里的 objective function(目标函数)理解为:如何花最少的时间,获得关于这个人的最大可见度或了解?这就是我们在做的事情之一。也就是说,我们实际上是在训练一个 interviewer,它并不是真的去询问某个平台上的事实信息或某段特定经历,而是去挖掘人们有哪些 life story(人生故事)可以用来训练我们为这些 agent 构建的模型。然后,对于更偏 factual(事实性的)或者更离散的 choices、choice questions、surveys 等,这些方式也非常高效。它们在时间和数据上都很高效,因为人们可以在很短时间内填写很多问题。
Speaker 120:17 - 20:27
So for those, we actually do leverage them. For instance, if you want to just have a broad understanding of people's viewpoints on certain topics, certain policies, and things like that.
Speaker 120:17 - 20:27
所以对于这些方法,我们确实会加以利用。比如说,如果你只是想对人们在某些话题、某些政策之类问题上的观点形成一个宽泛的理解。
Speaker 220:27 - 20:37
You describe yourself as an applied AI lab. How do you think about where you want to build your own models versus where you want to rely on other existing models?
Speaker 220:27 - 20:37
你们把自己描述为一家 applied AI lab(应用型 AI 实验室)。对于哪些地方要构建自己的模型、哪些地方要依赖其他现有模型,你们是怎么思考的?
Speaker 120:37 - 21:33
So in terms of building our own model, really the core thesis here is there is an amazing model to be built that really uncoats the diversity of people's values, preferences, and taste in ways that simply a rational model cannot do. So one way I actually pose this, we're sort of building so imagine the current today's model are akin to the CPU of intelligence unit. It's a single model trained on amazingly rational data that is amazing at solving very complex objective questions. Simile's model is much more akin to developing something that is close closer to the GPU of the intelligence unit, where the idea here is we don't actually need a model that is superhuman at Simile. In fact, we want a model that's as human as possible, but we want to make sure that these models at the sort of individual subunits can represent the real viewpoints of different subpopulations.
Speaker 120:37 - 21:33
就构建我们自己的模型而言,这里的核心 thesis(论点)其实是:有一种非常了不起的模型值得被构建出来,它能够以单纯的 rational model(理性模型)根本做不到的方式,真正揭示人们价值观、偏好和品味的多样性。我有时会这样来描述这件事:我们某种程度上是在构建一种系统——可以把今天现有的模型想象成 intelligence unit(智能单元)里的 CPU。它是一个单一模型,在极其理性的数据上训练而成,非常擅长解决非常复杂、目标明确的问题。Simile 的模型则更像是在开发更接近 intelligence unit 中 GPU 的东西。这里的想法是,我们其实并不需要一个在 Simile 上达到 superhuman(超人级)水平的模型。恰恰相反,我们希望这个模型尽可能像人,但我们也希望确保这些模型在个体子单元这一层面上,能够代表不同 subpopulations(子群体)的真实观点。
Speaker 121:34 - 21:50
So where we see that gap, that's when we go develop our own model, but at the same time, we do leverage frontier models, for instance, as a way to coordinate the research. Frontier models are amazing at coming up with a research plan, so that's where those models actually do get leveraged.
Speaker 121:34 - 21:50
所以,当我们看到这种 gap(差距)时,我们就会去开发自己的模型;但与此同时,我们也会利用 frontier models(前沿模型),比如把它们作为协调研究的一种方式。frontier models 在提出研究计划方面非常出色,所以这些模型确实会在这类场景中被用上。
Speaker 221:50 - 22:02
Very interesting. Are people typically coming to you with questions around new product launches, you know, how they should be marketing their companies, pricing, all of the above.
Speaker 221:50 - 22:02
很有意思。人们通常来找你们,是带着哪些问题来的?比如新产品发布、应该如何营销他们的公司、如何定价,还是说以上这些都有?
Speaker 122:02 - 22:23
So it is all of the above. Our customer journey usually does, however, start with very concrete use cases and problems they are trying to solve. Concept testing is a big one. It's also a very straightforward one. So they have a new concept, new product idea, new market message they want to test, and they want to hear from their users what they would think about x y z.
Speaker 122:02 - 22:23
所以确实是以上这些都有。不过,我们的 customer journey(客户旅程)通常还是从非常具体的 use case(使用场景)和他们试图解决的问题开始。concept testing(概念测试)是一个很大的方向,也是一种非常直接的方向。也就是说,他们有一个新概念、新产品想法,或者一条新的市场传播信息想要测试,他们想听听用户会如何看待 x y z。
Speaker 122:24 - 22:58
This is one way for them to quickly test those ideas. And then the promise they quickly see is, well, right now, we're very much in the practice of testing five to 10 different ideas at nodes, but what does it look like for us to test instantly thousand different ideas across thousand different subpopulations? That's the initial vision they see. Then we really get into the nitty gritty details of, well, where does simulation go from here? They then pretty soon start asking, well, can this be used to do product testing, but not just simply submitting an image?
Speaker 122:24 - 22:58
这是他们快速测试这些想法的一种方式。然后他们很快就会看到其中的 promise(潜力):比如,当前我们非常习惯于一次测试 5 到 10 个不同想法,但如果我们能够立刻在上千个不同 subpopulations 上测试上千个不同想法,会是什么样子?这是他们最初看到的愿景。接着我们就会进一步深入到那些 nitty gritty details(细节问题)里:也就是说,simulation(模拟)接下来会走向哪里?然后他们很快就会开始问:这是否可以用于产品测试,而且不只是简单地提交一张图片?
Speaker 122:58 - 23:28
But imagine basically asking these agents, go experience this product for ten minutes and tell us about what you experienced, what you saw. So you're basically adding temporal dimension. Then you go into things like multi agent simulation. Some of our customers very routinely actually ask us to simulate their earnings call. This is actually a use case that both surprised me at first, but this is also surprisingly a common ask because of course the CEOs and board members always need to think about, hey, how are going to design our earnings call?
Speaker 122:58 - 23:28
但你可以想象,基本上就是让这些 agent 去体验这个产品十分钟,然后告诉我们他们体验到了什么、看到了什么。这样你其实就是加入了时间维度。接着你还会进入 multi-agent simulation(多 agent 模拟)这类场景。我们的一些客户其实会非常常规地要求我们去模拟他们的 earnings call。这个用例一开始确实让我很意外,但它又出人意料地常见,因为 CEOs 和董事会成员当然总是需要思考:我们该怎么设计这场 earnings call?
Speaker 123:28 - 23:34
How would the audience react? So that is something that we also do, and this is very much multi agent simulation.
Speaker 123:28 - 23:34
听众会如何反应?这也是我们会做的事情,而且这非常典型地属于 multi-agent simulation。
Speaker 223:34 - 24:05
It seems like there's so many use cases that could potentially be tested once you have a simulated customer population. I'm curious, the value of research and testing in SIEM versus just like, let's say you have a new product concept that you want to test. Why not just go run a thousand Facebook ads and you actually get the click through rates on this stuff? Isn't that real world data almost more useful than the simulated data on how people might behave that you then correct for with your own models?
Speaker 223:34 - 24:05
感觉一旦你拥有一个模拟出来的客户群体,就会有非常多潜在用例可以被测试。我很好奇,在 SIEM 里做研究和测试的价值,与比如说你有一个新产品概念想测试相比,差别在哪里。为什么不直接去投一千条 Facebook ads,然后你还能真正拿到这些内容的 click-through rate(点击率)?那种关于人们实际会怎么行为的真实世界数据,难道不几乎比模拟数据更有用吗?毕竟后者只是关于人们可能如何行为的模拟,而你们还要再用自己的模型去校正。
Speaker 124:05 - 24:44
So it's a great question. And I think to some extent, here the answer has to do with initially scale, and then down the line, truly the new capability that comes because you can simulate the interactions. The scale question here is actually quite straightforward, where, yes, you can absolutely run Facebook ads and Facebook testing, but the kind of experiments that you can run-in simulation is actual behavior simulation at scale. Right? So you can basically pull in any number of users, doesn't even have to be bounded by the number of population that's available on Facebook, and it's also much more representative because only certain groups of people will actually respond to the online experiments.
Speaker 124:05 - 24:44
这是个很好的问题。我认为在某种程度上,这里的答案一开始和规模有关,而再往后,则真正和“因为你可以模拟交互而产生的新能力”有关。这里关于规模的问题其实非常直接:没错,你当然可以投 Facebook ads、做 Facebook 测试,但你能在 simulation(模拟)里运行的实验,是大规模的真实行为模拟。对吧?你基本上可以调入任意数量的用户,甚至不必受限于 Facebook 上可触达的人口规模,而且它的代表性也更强,因为真正会对线上实验作出反应的,其实只是一部分特定人群。
Speaker 124:44 - 25:20
But similarly, the model that we are creating, one of the key promises is that it is representative. We do the hard work of actually getting the representative set of people and then collecting the data that would actually represent them properly. So the scale representativeness is something that many of our users do not have easy access to. This is actually one of the common ask also that we do get or common sort of pain points that we have heard, where the question that many of these people have isn't about what questions do we ask these people, but it's about, in the first place, how can we get to the population that we're excited to talk to? That's a huge bottleneck.
Speaker 124:44 - 25:20
同样地,我们正在构建的模型,其中一个核心承诺就是它具有代表性。我们会去做那些艰苦的工作,真正找到具有代表性的一组人,然后收集能够正确代表他们的数据。所以,规模和代表性,是很多用户并不容易获得的东西。这其实也是我们经常听到的一个常见诉求,或者说常见痛点:很多人的问题并不在于“我们该问这些人什么问题”,而在于“首先,我们怎么才能接触到那群我们真正想交流的人?”这本身就是一个巨大的瓶颈。
Speaker 125:21 - 26:08
Then down the line, you can actually really start to imagine, and this is something that our customers and some of the most forward looking customers are now going into, which is what are all the downstream implications of the decisions that you make? It's not just about whether imagine you have this particular product, do you like it or do you not like it? Would you pay for this, not pay for this? It's not necessarily just that initial questions that we want to answer and finish, but we want to understand, imagine you're a car company, you launched an electric vehicle in this market, maybe the electric vehicle does really, really well, so we can help you do concept testing around marketing and the product around the electric vehicle, But what does that do to the perception of, let's say, non electric vehicle? Does it change the market perception?
Speaker 125:21 - 26:08
再往后,你其实就可以真正开始设想——而这也是我们的客户,尤其是一些最具前瞻性的客户,现在正在进入的方向——那就是:你所做决策的所有下游影响到底是什么?问题不只是,假设你有这样一个产品,你喜欢还是不喜欢?你会不会为它付费?我们想回答并结束的,并不一定只是这些最初级的问题;我们还想理解的是,假如你是一家 car company,你在这个市场推出了一款 electric vehicle,也许这款 electric vehicle 表现非常非常好,所以我们可以帮助你围绕它的营销和产品本身做 concept testing(概念测试),但是这会如何影响,比如说,non-electric vehicle 的认知?它会改变市场 perception(认知)吗?
Speaker 126:08 - 26:31
Then what does it mean for the rest of the product line? And how do you balance those kind of second order impact of your decision in a way that is more evidence based? Today, there's no way to test for this. You can run this in simulation. So really going beyond simply asking one question at a time, but then to think about what are the long term implications of your decisions is something that our customers are quite excited by.
Speaker 126:08 - 26:31
接着,这对其余产品线又意味着什么?你又该如何以一种更基于证据的方式,去权衡你这个决策带来的二阶影响?在今天,这种事是没法测试的。而你可以在 simulation 里测试它。所以,真正令人兴奋的地方,是不再只是一次问一个问题,而是开始思考你的决策会带来哪些长期影响——这一点让我们的客户非常感兴趣。
Speaker 226:31 - 26:47
I'd love to understand how you think about how predictive your model is in actually simulating real human behavior. I imagine you have lot of evals on this. I guess, what is your North Star metric? How do you guys do on that? And what do you think is the theoretical limit?
Speaker 226:31 - 26:47
我很想了解,你们是如何看待自己的模型在模拟真实人类行为方面到底有多强的预测性的。我猜想你们在这方面一定做了很多 evals(评估)。那么,你们的 North Star metric(北极星指标)是什么?你们在这个指标上的表现如何?以及,你认为它的理论上限在哪里?
Speaker 126:47 - 27:19
It's a great question. So theoretical limit, and let me just start from there, certainly does exist in the sense that humans are genuinely, there's a lot of randomness, that if you ask me the same question, I'll actually answer the question slightly differently. There is something in that degree of randomness in human behaviour. However, there's a lot of gains in performance that we can have even today in the way we are predicting people. So the measurement that we do is so at the level of population, we measure the distribution of responses if it is more quantitative.
Speaker 126:47 - 27:19
这是个很好的问题。所以先从理论极限说起,这种极限当然是存在的,因为人类本身确实带有很强的随机性——如果你问我同一个问题,我其实每次都会给出略有不同的回答。人类行为里确实存在这种程度的随机性。不过,即便如此,我们今天在“预测人”这件事上的表现仍然还有很大的提升空间。所以我们的衡量方式,其实是在总体 population 层面上进行的;如果是更偏定量 quantitative 的任务,我们会测量回答的分布 distribution。
Speaker 127:19 - 27:57
So we actually measure total various distance, which basically shows how close are the distributions of the ground truth versus the simulated information. And that is a metric that we run across all the use cases that our customers have, and we have certain threshold that we believe is good enough for decision making. So a TVD of, let's say, less than 0.15, we believe, is actually quite strong evidence for making decision. So that is a north star state that we want to hit for this class of use cases that are more quantitative, that's more question and answers. This also does cover RCTs, which is many of the core use cases our customers have.
Speaker 127:19 - 27:57
所以我们实际衡量的是 total various distance,这个指标基本上反映了真实情况 ground truth 的分布与模拟信息的分布到底有多接近。这是一个我们会在客户所有 use case 上统一运行的 metric(指标),而且我们有一个我们认为足以支持决策的阈值 threshold。比如说,如果 TVD 小于 0.15,我们认为这其实已经是非常有力的决策依据了。所以对于这类更偏定量、更多是问答 question-and-answer 的 use case 来说,这是我们想达到的 north star state(北极星目标状态)。这也包括 RCTs,而这正是客户很多核心 use case 的组成部分。
Speaker 127:58 - 28:10
Now, there's actually a really interesting question to ask around, well, what about multi agent simulation? What about all the downstream implications that we're going to be simulating? What does the evaluation of those look like?
Speaker 127:58 - 28:10
现在,其实还有一个非常有意思的问题:那 multi agent simulation(多 agent 模拟)怎么办?对于那些我们接下来要模拟的各种下游影响 downstream implications,又该怎么看?这些东西的评估 evaluation 又会是什么样子?
Speaker 228:10 - 28:22
Yeah, and then do daisy chain errors as you kind of If this one is 85% accurate and then this agent is telling another agent something, they're Do you accumulate errors as you go towards multi agent?
Speaker 228:10 - 28:22
对,那如果一路传递下去,会不会出现 daisy chain errors(链式误差)?比如这个系统有 85% 的准确率,然后这个 agent 又把信息告诉另一个 agent——当你走向 multi agent 时,误差会不会一路累积起来?
Speaker 128:22 - 28:39
Exactly. And one of the core thesis here is we basically see two categories of simulations. One simulation is what I would consider to be simulations that converge. The other categories of simulations are the simulations that diverge. And sometimes they actually coexist, and it's really about what research questions do you have.
Speaker 128:22 - 28:39
完全是这样。而这里的一个核心 thesis(论点)是:我们基本上把模拟分成两类。一类是我认为会收敛 converge 的模拟;另一类则是会发散 diverge 的模拟。有时候这两类实际上会同时存在,关键还是看你的 research question(研究问题)到底是什么。
Speaker 128:39 - 29:14
Questions that converge doesn't actually matter if you have a little bit of error. Now, the error cannot be obviously so dramatic that it sort of is completely detached from reality, but you actually are okay even if the errors do compound over time because the pool towards the convergence is strong enough that you'll actually understand where everything would fall. A good example here actually is if you simulate a network of people, then that network will always have a hub that gets formed. This is what sort of network scientists would call the skill free network, for instance. This is actually what powered Google too.
Speaker 128:39 - 29:14
对于会收敛的问题来说,存在一点误差其实并不那么重要。当然,这个误差不能大到明显脱离现实,但即便误差会随时间叠加,你通常也还是可以接受,因为收敛的拉力足够强,最终你仍然能看清所有东西会落到哪里。这里一个很好的例子是:如果你模拟一个人际网络,那么这个网络最终总会形成一个 hub(枢纽节点)。比如说,这就是 network scientists(网络科学家)所说的 scale-free network(无标度网络)。这其实也是 Google 得以运作的重要基础之一。
Speaker 129:14 - 29:43
One of the core observation of PageRank was it doesn't matter how these networks actually get formulated. You actually see some web pages that get exponentially more links that are attached to it. This is a very fundamental behavior in humans. That we also see in simulated networks, and that convergence always happens as long as you are replicating human behavior with certain threshold accuracy. Now there are then questions that generally do diverge.
Speaker 129:14 - 29:43
PageRank 的一个核心观察就是:这些网络究竟是如何形成的,其实并不重要。你总会看到某些网页获得指数级更多的链接指向它。这是人类当中一种非常基础的行为模式。我们在模拟网络里也会看到这一点,而且只要你能以某个阈值以上的准确率复制 human behavior(人类行为),这种收敛就总会发生。然后,当然也有一些问题通常是会发散的。
Speaker 129:43 - 30:13
It's like your classical questions like, was World War I inevitable or was it not? And there, it is sometimes difficult to run the same simulation over time and get the same exact outcome. Imagine you're running a This is not something that necessarily simile right now is going into, but imagine you're running a simulation of an election. Will the same person win the election every time? There are a lot of downstream implications of every single decision that does happen, so it does diverge.
Speaker 129:43 - 30:13
这有点像那些经典问题,比如:第一次世界大战是不可避免的吗,还是并非如此?在这种问题上,要随着时间反复运行同一个 simulation(模拟)并得到完全一样的结果,有时会很困难。设想一下——这未必是 simile 目前正在做的事情——假如你在运行一场选举的 simulation,同一个人每次都会赢得选举吗?每一个实际发生的决策都会带来大量下游影响,因此结果确实会发散。
Speaker 130:14 - 30:35
There, the core evaluation is around confidence. So imagine you run the simulation 100 times. How many of those times do the results come out to be x? And how can we actually use that to basically create a bootstrap resampling to calculate the confidence around the simulations? Those are some of the questions that we do ask.
Speaker 130:14 - 30:35
在这里,核心评估围绕的是 confidence(置信度)。所以你可以想象一下,如果你把这个 simulation(模拟)运行 100 次,其中有多少次结果会是 x?而我们又该如何真正利用这一点,基本上构建出一种 bootstrap resampling(自助重采样),来计算这些 simulation 周围的 confidence?这些就是我们确实会提出的一些问题。
Speaker 130:35 - 30:54
And a huge part of this also, of the power of simulation, is then to show when it diverges, to show the diversity of possible outcomes so that people can actually look, understand the cause or mechanism of how we got to those outcomes, and prepare for those features. So those are some of the implications of divergence in simulations.
Speaker 130:35 - 30:54
另外,这里面还有很大一部分,也就是 simulation 的力量所在,是要展示它何时会 diverge(发散),展示各种可能结果的多样性,这样人们才能真正去看、去理解我们是如何得到这些结果的原因或机制,并为这些特征做好准备。所以这些就是 simulation 中 divergence 所带来的一些含义。
Speaker 230:55 - 31:07
Are there any mathematical descriptions of like why something would converge or diverge? Like, I'm imagining if you have like an average function, maybe you converge, and then if it's like a, you know, you're splitting outcomes to binary, then you might maybe diverge, but Yeah.
Speaker 230:55 - 31:07
有没有一些数学上的描述,可以解释为什么某件事会 converge(收敛)或者 diverge 呢?比如我在想,如果你有一个 average function(平均函数),也许就会收敛;如果是那种、你知道的,把结果分裂成 binary(二元)分支的情况,那么也许就会发散,不过,是的。
Speaker 131:08 - 31:45
So the intuition I think is close, and technically this is also a research topic. So similarly, as a company where we do go deep into this research topic, in the sense that I see simulation as a field as akin to developing your day one of inferential statistics. You know, inferential statistics scientists actually had to do a lot of discussion and research over time to decide that P less than 0.05 is actually evidence that is strong enough for science. Similarly, it's working on setting the same kind of threshold and standards for the rest of the field. So those are the intuition.
Speaker 131:08 - 31:45
所以我觉得这种直觉是接近的,而从技术上讲,这本身也是一个 research topic(研究课题)。同样地,作为一家公司,我们确实会深入研究这个课题,因为在我看来,simulation 这个领域有点像是在发展 inferential statistics(推断统计学)的“第一天”。你知道,做 inferential statistics 的科学家其实也花了很多时间讨论和研究,才最终认定 P less than 0.05 实际上是对科学来说足够强的证据。类似地,这项工作也是在为这个领域的其余部分设定同类的 threshold(阈值)和 standards(标准)。所以这就是其中的直觉。
Speaker 131:45 - 31:56
I think that's exactly the right intuition in terms of actually how to make a robust mathematical equation around what's going to happen when, it is a real research frontier for simulations.
Speaker 131:45 - 31:56
我认为,就“如何围绕在什么情况下会发生什么,建立一个稳健的数学方程”这一点来说,你的直觉完全是对的,而这也正是 simulation 领域一个真正的 research frontier(研究前沿)。
Speaker 231:56 - 32:22
Thank you for being nice about my vibe mathing. I'm curious, you know, it seems like So there's a lot of Fortune 500s coming to you. I'm wondering whether there are, you know, non existing corporate use cases that might you know, there are, like, great mysteries of our society that might become solved. And for example, I'm wondering about economics, you know, central bank decisions. Oftentimes, like, personally believe in that macro, nobody knows nothing.
Speaker 231:56 - 32:22
谢谢你这么友善地看待我这种 vibe mathing。我很好奇,看起来似乎有很多 Fortune 500 公司在找你们。我在想,是否还存在一些目前尚未出现的企业 use case(用例);或者说,我们社会中的一些重大谜题,也许会因此被解开。比如说,我在想 economics(经济学),比如 central bank(中央银行)的决策。很多时候,就我个人而言,我相信在 macro(宏观经济)这个层面上,没人真正知道什么。
Speaker 232:23 - 32:52
And oftentimes, lot of the issues come about from human psychology. So to me, macroeconomics is a function of simulating human behavior at scale. I'm thinking even in the venture capital use case, we often debate internally, does value accrue to this company or not? You could run the simulation of all the different layers of the AI stack and almost figure out where durability and value accrues. You had a kind of perfect simulator of human behavior.
Speaker 232:23 - 32:52
而且很多问题往往都来自 human psychology(人类心理)。所以对我来说,macroeconomics(宏观经济学)本质上是大规模模拟 human behavior(人类行为)的一个函数。我甚至会想到 venture capital(风险投资)的 use case:我们内部经常会争论,价值到底会不会积累到这家公司身上。你可以把 AI stack(AI 技术栈)的所有不同层都拿来运行 simulation,几乎就能弄清 durability(持久性)和 value(价值)会积累在哪里——前提是你拥有某种近乎完美的 human behavior 模拟器。
Speaker 232:53 - 33:02
There's so much more you could do than serving the Fortune 500. Do you agree with that? And then if so, are you serving governments, know, the like?
Speaker 232:53 - 33:02
你们能做的事情显然远不止服务 Fortune 500。你同意这一点吗?如果同意的话,那么你们是否也在为 governments(政府)之类的对象提供服务?
Speaker 133:03 - 33:22
Yeah. So it's interesting. When we were still researching in this area, the way I actually got back then my advisors, Michael and Percy, excited about this was I basically told them, Look, we do this right. There's a Nobel Prize to be won there. And I truly believe that.
Speaker 133:03 - 33:22
是的。所以这很有意思。当我们还在研究这个领域时,我当年真正让我的导师 Michael 和 Percy 对这件事兴奋起来的方式,基本上就是我告诉他们:你看,如果我们把这件事做对了,这里面有一个 Nobel Prize 可拿。而且我至今真的这么相信。
Speaker 133:22 - 34:10
And it's also not surprising in that your classical economics simulations, things like agent based models, that really pioneered our understanding of back in the day, the kind of topics they studied was, how does segregation happen? What are the cause or mechanism for segregation? So scholars like Thomas Schelling would actually build agent based models that are extremely simple and rudimentary, but that showed something deep about human macro behaviors. And he, of course, went on to win a Nobel Prize. I see the same opportunity here, but in an augmented way, where back in the day, the agent based models were very much deterministic in some sense, where you basically, in this simulation of, let's say, like, motor segregation from, like, thirty years ago, individual agent was simply red dot or blue dot.
Speaker 133:22 - 34:10
这也并不令人意外,因为在传统的 economics simulation(经济学模拟)里,像 agent based model(基于 agent 的模型)这样的东西,曾经真正开创了我们对许多问题的理解。那时候他们研究的主题包括:segregation(隔离)是如何发生的?导致 segregation 的原因或机制是什么?所以像 Thomas Schelling 这样的学者,会构建非常简单、非常初级的 agent based model,但它们却揭示了关于人类宏观行为的深刻东西。而他当然后来也获得了 Nobel Prize。我在这里看到了同样的机会,只是是一个增强版。因为当年那些 agent based model 在某种意义上非常 deterministic(确定性),比如在大约三十年前那种关于居住隔离的模拟里,每个 agent 其实就只是一个 red dot 或 blue dot。
Speaker 134:11 - 34:43
And every game iteration, they would look around its corner, see how many of its neighbors are of the same color. And if that threshold goes below certain threshold, then they will decide to move to a new location. That was it. But now we can actually create real agents that replicate the full richness of individuals and run the same kind of simulations. So the kind of questions that we can ask that goes beyond simply the commercial use cases, for instance, in context of macroeconomics, actually, the questions that I actually did get asked from economists were things like, When does bank fraud happen?
Speaker 134:11 - 34:43
在每一轮博弈迭代中,它们会看看自己周围,数一数邻居里有多少和自己是同一种颜色。如果这个比例低于某个阈值,它们就决定搬到一个新位置。仅此而已。但现在,我们实际上可以创建真正的 agent,去复制个体全部的丰富性,并运行同类模拟。所以,我们现在能提出的问题,已经不只是商业应用场景了。比如在 macroeconomics(宏观经济学)的语境里,经济学家确实问过我的问题会是:bank fraud(银行欺诈)究竟什么时候会发生?
Speaker 134:44 - 35:02
Or questions like climate change. One of the sort of core blocker of climate, like solving that issue, is the collective action problem of many nations. Can we actually simulate that? Or What are the signals of a democracy that is about to collapse? Can we understand the origin story of the monetary system?
Speaker 134:44 - 35:02
或者像 climate change(气候变化)这样的问题。气候问题之所以难解决,一个核心阻碍就是许多国家之间的 collective action problem(集体行动问题)。我们能不能真的把它模拟出来?再或者,一个 democracy(民主制度)即将崩溃时,会出现哪些信号?我们能不能理解 monetary system(货币体系)的起源故事?
Speaker 135:03 - 35:34
These are the kind of simulations that I do believe ought to be the North Star State of this field. And it is sort of interesting to imagine, like, what that would actually look like in practice, right, because these would involve very large scale simulations with many agents interacting with each other. I do see a future where, today, this is something not the case. Today, a simulation is quick and fast to run. But what about simulation that takes actually $100,000,000 to run once and could take many months to run?
Speaker 135:03 - 35:34
我确实相信,这类模拟才应该成为这个领域的 North Star(北极星式目标)。而且,去想象它在实践中究竟会是什么样子,也挺有意思,对吧,因为这会涉及大规模模拟,其中有许多 agent 相互互动。我确实看得到这样一个未来:今天这还不是现实。今天的 simulation(模拟)通常运行得又快又省事。但如果有一种 simulation,一次运行就真的要花 $100,000,000,而且可能要跑上好几个月呢?
Speaker 135:35 - 35:44
But when we run it, it solves one of the fundamental questions of our society. That I do think is genuinely a very exciting possibility for this field. I agree.
Speaker 135:35 - 35:44
但当我们运行它时,它能解答我们社会的一些根本性问题。我确实认为,这对这个领域来说是真正非常令人兴奋的一种可能性。我同意。
Speaker 235:44 - 35:58
I'm even thinking like politics, for example, could be forever changed. Today everyone has an agenda of how they say some policy change will impact things. Well, why don't we just run the simulation?
Speaker 235:44 - 35:58
我甚至在想,比如 politics(政治)都可能因此被永久改变。今天每个人都会带着自己的 agenda(立场/议程),去宣称某项政策变化会如何影响现实。那我们为什么不直接运行 simulation 呢?
Speaker 135:58 - 36:05
And understand all the downstream implications and not just what's going to happen this year, but what does it mean in the next five to ten years.
Speaker 135:58 - 36:05
然后去理解所有 downstream implications(下游影响),而不只是看今年会发生什么,还要看它在未来五到十年意味着什么。
Speaker 236:05 - 36:14
Exactly. Fascinating. I was gonna close by asking you what makes you excited about the future? Is it what we just talked about or is it something else?
Speaker 236:05 - 36:14
完全同意。很有意思。最后我本来想问你,是什么让你对未来感到兴奋?是我们刚刚谈到的这些,还是别的什么?
Speaker 136:14 - 37:05
I am somebody who is quite inspired by science fiction, and when you read science fiction that covers societies that have progressed far enough in its technological maturity, you always see two pillars. You have some version of AGI, and you have some version of simulations that really help guide the society. I do see an opportunity today to really take the first crack at building the simulation. I would not have said that even five years ago, but that is the conviction that we have built up over the years as we are going deep into this research. And what's exciting is there's a clear use case today that can serve our users, but then there's a lot of innovation that is yet to come that I do think will build up to actually building simulator that's akin to discern human society.
Speaker 136:14 - 37:05
我这个人很受 science fiction(科幻)启发,而当你读那些描写社会在技术成熟度上已经发展到相当高阶段的 science fiction 时,你总会看到两个支柱:某种形式的 AGI,以及某种形式、能够真正帮助引导社会的 simulation(模拟)。我确实觉得,今天我们有机会第一次真正尝试把这种 simulation 建出来。即便五年前我都不会这么说,但这是这些年来随着我们不断深入这项研究而逐步建立起来的信念。令人兴奋的是,今天已经有一个清晰的 use case(应用场景)可以服务我们的用户,但与此同时,未来还有大量尚未到来的创新,而我确实认为,这些创新最终会积累到真正建成一种能够洞察 human society(人类社会)的 simulator(模拟器)。
Speaker 137:05 - 37:39
And one of the things that one of my co founder, Percy, sometimes say is you look at the greatest scientific innovation, they often start from an amazing measurement. Hubble telescope really changed the trajectory of how we understand the universe. Simulation can be that for human society. So the thing that does excite me, there's a lot of focus on natural sciences, but how can simulation really unlock our understanding of humanity and social sciences, and how can we actually use it to make our society be a better place? That's exciting.
Speaker 137:05 - 37:39
我的 co-founder Percy 有时会说,你去看那些最伟大的科学创新,它们往往都始于一次惊人的 measurement(测量)。Hubble telescope 真的改变了我们理解宇宙的轨迹。simulation 对 human society 也可能起到这样的作用。所以真正让我兴奋的是,大家现在对 natural sciences(自然科学)关注很多,但 simulation 要怎样才能真正解锁我们对 humanity(人性/人类)以及 social sciences(社会科学)的理解?我们又该怎样真正利用它,让我们的社会变得更好?这很令人兴奋。
Speaker 237:40 - 38:05
Totally. I remember reading somebody was excited about, you know, there was a small but breathtaking chance that the field of economics, as we know it, may actually become solved by simulation. And I'd extend that not just to be economics, but kind of everything that deals with human behavior and social sciences, which ultimately is everything around us. Truly. Wonderful.
Speaker 237:40 - 38:05
绝对是。我记得我读到过,有人对这样一种可能性感到兴奋:虽然概率很小,但也令人屏息——我们如今所理解的 economics(经济学)这个领域,或许真的可能被 simulation 解出来。我还想把这个范围从 economics 扩展出去,不只是经济学,而是几乎一切与 human behavior(人类行为)和 social sciences 有关的东西,而归根结底,那其实就是我们身边的一切。确实如此。太精彩了。
Speaker 238:05 - 38:12
Thank you so much for joining today and sharing the story of both Smallville and what you're now up to at SIMI. I really enjoyed the conversation.
Speaker 238:05 - 38:12
非常感谢你今天加入我们,分享了 Smallville 的故事,以及你现在在 SIMI 正在做的事情。我真的很享受这次对话。
Speaker 138:12 - 38:13
Same here. Thank you for having me.
Speaker 138:12 - 38:13
我也是。谢谢你邀请我。
原文 ↗https://www.youtube.com/watch?v=lfhFmwcESRw
BuildSpeak — 关于本项目BUILT IN PUBLIC · 跟随 builders 而非 influencers