🐦 X · 动态Swyx @swyx· 2026 年 6 月 8 日· 261 词 · 约 1 分钟

Swyx · @swyx

SPACE 播放 / 暂停←→ 上一句 / 下一句

@METR_Evals previously on @cognition_labs

@METR_Evals，此前在 @cognition_labs

♥ 5↻ 0💬 2x.com ↗

It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality. Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks. FC Diamond is so hard that Opus 4.8 scores 13.8%. Three eras of AI coding : Three eras of benchmarks 2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months. This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails. My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027.... The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.

它终于发布了！！！@METR_Evals 发现，超过一半的 SWEBench 结果都是无法合并的 slop（低质产出）。FrontierCode 代表了 1000+ 小时经 maintainer（维护者）验证的软件工程工作，而大多数 frontier models（前沿模型）目前还无法解决这些工作，更不用说高质量地解决了。Cog 拥有 IOI Gold medalists 和顶级代码维护者——Look At The Data——FrontierCode 包含 3000+ 条 rubrics（评分细则），覆盖代码质量以及困扰其他 benchmarks（基准测试）的 anticheat reward hacking（反作弊奖励破解/刷分）问题。FC Diamond 难到连 Opus 4.8 的得分都只有 13.8%。AI coding 的三个时代：benchmarks（基准）的三个时代。2021 • Autocomplete：HumanEval。2023 • Passing Tests：SWEBench、TerminalBench。2026 • Maintainable Code：FrontierCode。对我来说，最漂亮的一张图是：当我要求对所有现存旧模型做一次特别的历史回跑时，数据显示 FC 任务中最简单的三分之一（在 FC Extended 中）在 2025 年末被迅速且突然地攻克了——Opus 在 4 个月内几乎翻倍，从 41% 的 pass rate（通过率）升到 74%。这描述了很多人——从 @dhh 到 @karpathy——都指出的那种“2025 年 12 月到底发生了什么”的 vibe shift（氛围转变）：区别在于，你是在 2 次 rerolls（重试）内拿到 95% 成功率，还是要 6 次；这最终让 agentic coding（agent 驱动编程）向上进入下一层抽象成为可行之事，例如 @GeoffreyHuntley 的 ralph loops、@bcherny 的 /goals，或 @steipete 所说的“让你的 agents 接收提示并循环执行的 loops”，而不必太担心事情彻底失控。我的猜测是：随着 AI 从这里开始继续加速，FrontierCode 的每个 tier（层级）都会依次饱和，希望大约是每年一个。我已经让团队开始准备 FrontierCode 2027 了……旧日的高山将被摧毁，它们的碎石会变成 regolith（月壤/风化层）；而从那 regolith 之中，下一片模型森林将会生长。生命循环。

♥ 631↻ 59💬 70x.com ↗

原文 ↗https://x.com/swyx

🐦 X · 动态Swyx @swyx· 2026 年 6 月 8 日· 261 词 · 约 1 分钟

Swyx · @swyx

SPACE 播放 / 暂停←→ 上一句 / 下一句

@METR_Evals previously on @cognition_labs

@METR_Evals，此前在 @cognition_labs

♥ 5↻ 0💬 2x.com ↗

♥ 631↻ 59💬 70x.com ↗

原文 ↗https://x.com/swyx