🐦 X · 动态Madhu Guru @realmadhuguru· 2026 年 6 月 7 日· 104 词 · 约 1 分钟

Madhu Guru · @realmadhuguru

SPACE 播放 / 暂停 上一句 / 下一句
A common misconception is that training data is low skill, grunt work - scan some notebooks, mine the internet, create labeled samples. The data required to advance the model frontier is the opposite. Labs need training data for high-economic-value tasks. And most of these tasks outside of SWE have little documentation - it is complex, domain-specific knowledge built over the years, spanning legacy tools that dont talk to each other. That's why we have SWE agents and not knowledge work agents yet. The companies creating this training data, such as Mercor, are doing extremely high-leverage, high-skill work. Critical to moving AI forward. And deeply underappreciated.
一个常见的误解是,training data(训练数据)是低技能的苦力活——扫一些笔记本、在互联网上挖数据、制作带标签的样本。推进 model frontier(模型前沿)所需要的数据恰恰相反。实验室需要的是用于高经济价值任务的训练数据。而这些任务中,除了 SWE 之外的大多数几乎都缺乏文档——它们是多年积累形成的复杂、强领域特定的知识,横跨彼此无法互通的 legacy tools(遗留工具)。这就是为什么我们现在有 SWE agents(软件工程 agent),却还没有 knowledge work agents(知识工作 agent)。像 Mercor 这样创建这类训练数据的公司,做的是杠杆效应极高、技能要求极高的工作。这对推动 AI 前进至关重要,也长期被严重低估。
原文 ↗https://x.com/realmadhuguru