BuildSpeak每日 builder 文摘
今日归档生词本关于
AK

Andrej Karpathy

@karpathy ↗

I like to train large deep neural nets. Previously Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford.

1 最新4 累计3 期
每条推文 hover 显示单独 ▶
2026 年 5 月 20 日 · 1 条 →

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

个人近况更新:我已加入 Anthropic。我认为,未来几年处在 LLMs 前沿的发展阶段将尤其具有塑造性。我非常高兴加入这里的团队,并重新回到 R&D(研发)工作中。我依然对教育怀有深厚热情,并计划在适当的时候继续这方面的工作。

♥ 131.9K↻ 10.0K💬 7.2K5/19 · 15:05x.com ↗
2026 年 5 月 12 日 · 1 条 →

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

顺便说一句,这个方法真的很好用:在你的 query(查询)最后让 LLM “structure your response as HTML”,然后在浏览器里查看生成的文件。我也试过让 LLM 把输出做成 slideshows(幻灯片)之类,效果也不错。更广泛地说,imo,人类更偏好用 audio(音频)作为给 AI 的输入,但更偏好用 vision(视觉:图像/动画/video)来接收它们的输出。我们大约有三分之一的大脑,本质上都是一个专门用于视觉处理的海量并行处理器;视觉就像信息进入大脑的 10 车道超级高速公路。随着 AI 进步,我认为我们会看到一种逐步演进、并充分利用这一点的形式:1)raw text(纯文本,阅读困难/费力)2)markdown(粗体、斜体、标题、表格,视觉上稍微轻松一些)<-- 当前默认 3)HTML(底层依然是带代码的 procedural〔程序式〕形式,但在图形、布局,甚至交互性上灵活得多)<-- 还早期,但正在形成新的良好默认 ……4,5,6,… n)interactive neural videos/simulations(交互式神经视频/模拟)。在我看来,沿着这条路径外推下去——虽然技术现在还不存在——终点会是某种由 diffusion neural net(扩散神经网络)直接生成的交互式视频。至于精确/程序式的 “Software 1.0” artifacts(例如 interactive simulations)将如何与 neural artifacts(如 diffusion grids)编织结合起来,仍有很多开放问题;但总体方向上,大致就是最近爆火的那类东西。与此同时,输入端也还有必要且即将到来的改进。单靠 audio、text 或 video 都不够;比如我会想要在屏幕上指点、做手势,类似你和一个真实坐在你电脑屏幕旁边的人交流时会做的那些事。TLDR:人类和 AI 之间在输入/输出层面的“mind meld(心智融合)”仍在持续推进,还有很多工作要做、很多重要进展要取得,远远还没到直接一步跳进 neuralink-esque BCI(脑机接口)之类的时候。就现阶段值得探索的东西来说,一个实用热建议:试着要求它输出 HTML。

♥ 13.0K↻ 1.3K💬 6625/11 · 16:20x.com ↗
2026 年 5 月 1 日 · 2 条 →

This is the the quote I've been citing a lot recently.

这是我最近经常引用的那段话。

♥ 31.4K↻ 2.8K💬 4884/30 · 17:43x.com ↗

Fireside chat at Sequoia Ascent 2026 from a ~week ago. Some highlights: The first theme I tried to push on is that LLMs are about a lot more than just speeding up what existed before (e.g. coding). Three examples of new horizons: 1. menugen: an app that can be fully engulfed by LLMs, with no classical code needed: input an image, output an image and an LLM can natively do the thing. 2. install .md skills instead of install .sh scripts. Why create a complex Software 1.0 bash script for e.g. installing a piece of software if you can write the installation out in words and say "just show this to your LLM". The LLM is an advanced interpreter of English and can intelligently target installation to your setup, debug everything inline, etc. 3. LLM knowledge bases as an example of something that was *impossible* with classical code because it's computation over unstructured data (knowledge) from arbitrary sources and in arbitrary formats, including simply text articles etc. I pushed on these because in every new paradigm change, the obvious things are always in the realm of speeding up or somehow improving what existed, but here we have examples of functionality that either suddenly perhaps shouldn't even exist (1,2), or was fundamentally not possible before (3). The second (ongoing) theme is trying to explain the pattern of jaggedness in LLMs. How it can be true that a single artifact will simultaneously 1) coherently refactor a 100,000-line code base *and* 2) tell you to walk to the car wash to wash your car. I previously wrote about the source of this as having to do with verifiability of a domain, here I expand on this as having to also do with economics because revenue/TAM dictates what the frontier labs choose to package into training data distributions during RL. You're either in the data distribution (on the rails of the RL circuits) and flying or you're off-roading in the jungle with a machete, in relative terms. Still not 100% satisfied with this, but it's an ongoing struggle to build an accurate model of LLM capabilities if you wish to practically take advantage of their power while avoiding their pitfalls, which brings me to... Last theme is the agent-native economy. The decomposition of products and services into sensors, actuators and logic (split up across all of 1.0/2.0/3.0 computing paradigms), how we can make information maximally legible to LLMs, some words on the quickly emerging agentic engineering and its skill set, related hiring practices, etc., possibly even hints/dreams of fully neural computing handling the vast majority of computation with some help from (classical) CPU coprocessors.

大约一周前在 Sequoia Ascent 2026 的一场炉边谈话。几个要点:我努力推动的第一个主题是,LLM 不只是把原来已有的东西加速而已(例如 coding)。三个“新边界”的例子:1. menugen:一种可以被 LLM 完全吞没的 app,不需要任何经典代码:输入一张图像,输出一张图像,而 LLM 原生就能完成这件事。2. 用 install .md skills 替代 install .sh scripts。比如,要安装一款软件时,如果你可以把安装过程用文字写出来,再说一句“把这个直接给你的 LLM 看”,那为什么还要去写复杂的 Software 1.0 bash script 呢?LLM 是一种高级的英语解释器,能够智能地针对你的具体环境执行安装、内联调试所有问题,等等。3. LLM knowledge bases(知识库)是另一类例子:这类东西用经典代码是*不可能*实现的,因为它涉及对非结构化数据(知识)的 computation,这些数据来自任意来源、采用任意格式,包括纯文本文章等。我之所以强调这些,是因为每一次新的范式变化里,最显而易见的事情总是“把原有东西加速”或“以某种方式改进”,但这里我们看到的是一些功能:它们要么突然看起来甚至不该存在(1、2),要么在此前从根本上就不可能实现(3)。第二个(仍在持续展开的)主题,是试图解释 LLM 中那种 jaggedness(锯齿状、不均匀能力分布)的模式。为什么同一个产物可以同时 1)连贯地重构一个 10 万行的代码库,*并且* 2)告诉你走去 car wash 洗你的车。我之前写过,这种现象的来源与一个领域是否可验证(verifiability)有关;这里我进一步展开,认为它也与 economics(经济学)有关,因为 revenue/TAM 决定了 frontier labs 会在 RL 期间选择把什么内容打包进 training data distributions。你要么处在数据分布之内(跑在 RL circuits 的轨道上)一路飞驰,要么就是拿着 machete 在丛林里越野,至少相对而言是这样。我对这个解释仍然没有 100% 满意,但如果你想在实践中利用 LLM 的力量、同时避开它们的陷阱,就必须持续努力建立一个准确的 LLM 能力模型,而这也引出了……最后一个主题:agent-native economy。也就是把产品和服务分解为 sensors、actuators 和 logic(分别散落在 1.0/2.0/3.0 计算范式中),我们如何让信息对 LLM 尽可能 legible(可读、可解析),关于快速兴起的 agentic engineering 及其技能组合的一些看法、相关的招聘实践,等等,甚至还包括一些提示/梦想:未来也许 fully neural computing 能处理绝大多数 computation,而(经典)CPU coprocessors 只提供部分辅助。

♥ 3.4K↻ 411💬 1444/30 · 17:28x.com ↗
BuildSpeak — 关于本项目BUILT IN PUBLIC · 跟随 builders 而非 influencers