🐦 X · 动态Andrej Karpathy @karpathy· 2026 年 5 月 11 日· 308 词 · 约 2 分钟

Andrej Karpathy · @karpathy

SPACE 播放 / 暂停←→ 上一句 / 下一句

This works really well btw, at the end of your query ask your LLM to "structure your response as HTML", then view the generated file in your browser. I've also had some success asking the LLM to present its output as slideshows, etc. More generally, imo audio is the human-preferred input to AIs but vision (images/animations/video) is the preferred output from them. Around a ~third of our brains are a massively parallel processor dedicated to vision, it is the 10-lane superhighway of information into brain. As AI improves, I think we'll see a progression that takes advantage: 1) raw text (hard/effortful to read) 2) markdown (bold, italic, headings, tables, a bit easier on the eyes) <-- current default 3) HTML (still procedural with underlying code, but a lot more flexibility on the graphics, layout, even interactivity) <-- early but forming new good default ...4,5,6,... n) interactive neural videos/simulations Imo the extrapolation (though the technology doesn't exist just yet) ends in some kind of interactive videos generated directly by a diffusion neural net. Many open questions as to how exact/procedural "Software 1.0" artifacts (e.g. interactive simulations) may be woven together with neural artifacts (diffusion grids), but generally something in the direction of the recently viral There are also improvements necessary and pending at the input. Audio nor text nor video alone are not enough, e.g. I feel a need to point/gesture to things on the screen, similar to all the things you would do with a person physically next to you and your computer screen. TLDR The input/output mind meld between humans and AIs is ongoing and there is a lot of work to do and significant progress to be made, way before jumping all the way into neuralink-esque BCIs and all that. For what's worth exploring at the current stage, hot tip try ask for HTML.

顺便说一句，这个方法真的很好用：在你的 query（查询）最后让 LLM “structure your response as HTML”，然后在浏览器里查看生成的文件。我也试过让 LLM 把输出做成 slideshows（幻灯片）之类，效果也不错。更广泛地说，imo，人类更偏好用 audio（音频）作为给 AI 的输入，但更偏好用 vision（视觉：图像/动画/video）来接收它们的输出。我们大约有三分之一的大脑，本质上都是一个专门用于视觉处理的海量并行处理器；视觉就像信息进入大脑的 10 车道超级高速公路。随着 AI 进步，我认为我们会看到一种逐步演进、并充分利用这一点的形式：1）raw text（纯文本，阅读困难/费力）2）markdown（粗体、斜体、标题、表格，视觉上稍微轻松一些）<-- 当前默认 3）HTML（底层依然是带代码的 procedural〔程序式〕形式，但在图形、布局，甚至交互性上灵活得多）<-- 还早期，但正在形成新的良好默认 ……4,5,6,… n）interactive neural videos/simulations（交互式神经视频/模拟）。在我看来，沿着这条路径外推下去——虽然技术现在还不存在——终点会是某种由 diffusion neural net（扩散神经网络）直接生成的交互式视频。至于精确/程序式的 “Software 1.0” artifacts（例如 interactive simulations）将如何与 neural artifacts（如 diffusion grids）编织结合起来，仍有很多开放问题；但总体方向上，大致就是最近爆火的那类东西。与此同时，输入端也还有必要且即将到来的改进。单靠 audio、text 或 video 都不够；比如我会想要在屏幕上指点、做手势，类似你和一个真实坐在你电脑屏幕旁边的人交流时会做的那些事。TLDR：人类和 AI 之间在输入/输出层面的“mind meld（心智融合）”仍在持续推进，还有很多工作要做、很多重要进展要取得，远远还没到直接一步跳进 neuralink-esque BCI（脑机接口）之类的时候。就现阶段值得探索的东西来说，一个实用热建议：试着要求它输出 HTML。

♥ 13.0K↻ 1.3K💬 662x.com ↗

原文 ↗https://x.com/karpathy

🐦 X · 动态Andrej Karpathy @karpathy· 2026 年 5 月 11 日· 308 词 · 约 2 分钟

Andrej Karpathy · @karpathy

SPACE 播放 / 暂停←→ 上一句 / 下一句

♥ 13.0K↻ 1.3K💬 662x.com ↗

原文 ↗https://x.com/karpathy