BuildSpeak每日 builder 文摘
今日归档生词本关于
🐦 X · 动态Aaron Levie @levie· 2026 年 6 月 11 日· 429 词 · 约 2 分钟

Aaron Levie · @levie

SPACE 播放 / 暂停·←→ 上一句 / 下一句
Lots of evidence of huge jumps in capability for Fable across coding (and related) tasks. It’s also a major jump in accuracy and success in complex knowledge work tasks. In our Box AI Complex Work Eval, we tested the model against Opus 4.8 and saw huge boosts across almost every industry. For our eval we give the Box AI Agent, using Fable, a set of hard real world knowledge work problems that deal with enterprise documents. Then score how the agent performs the tasks. The main differentiators for Fable vs Opus 4.8 is that it doesn't take shortcuts on complex reasoning, it gets multi-step calculations right, and it's significantly more consistent across runs. We saw the biggest leaps in Media & Entertainment (78% vs 61%), Technology (81% vs 73%), Financial Services (89% vs 83%), and Healthcare (66% vs 60%). Here are some specific examples: * Legal M&A due diligence: On a task reviewing NDA terms against a semiconductor company's contracting policy, Fable correctly identified that a joint-ownership clause violates exclusivity requirements while a liability cap is permitted under a Super Cap exception. Fable scored 100% vs Opus's 78%. * Healthcare: On a clinical radiology error audit across 12 reports, Fable precisely categorized each error by severity grade and correctly concluded no Grade 3 errors existed. Opus prematurely escalated a case to "major error requiring immediate departmental review" when the evidence didn't support it — Fable 63% vs Opus 41%. * Media & Entertainment: On a genre profitability projection task, Fable correctly recognized that a 20% Argentine tax deduction was already embedded in the source spreadsheet figures and didn't double-apply it. Opus applied it again on top — a compounding error across 4 genre calculations that took its score negative on the task vs Fable's 74%. * Retail analytics: On a task analyzing high-growth product articles against an investment benchmark, Fable correctly computed each article's growth rate individually and identified that only 2 of 5 exceeded the threshold. Opus confused "high growth relative to average" with "above the benchmark" — scoring 61% vs Fable's 94%. * Financial Services: On a 5-year debt facility projection, Fable correctly applied interest to opening balances and used the right capex figure. Opus applied interest to the total facility amount and computed tax from the wrong base — two compounding errors. Fable scored 83% vs Opus's 62%. * Technology: On a SaaS feature valuation requiring computation of a Feature Value Index across multiple regions, Fable applied the formula correctly and got exact values for the markets. Opus got the arithmetic wrong on multiple criteria — Fable scored 100% vs Opus's 74%. Overall, huge step change in complex analysis, work that requires analytical reasoning, and deep domain understanding. Fable will be available shortly in the Box AI Studio for customers to build agents with.
有大量证据表明,Fable 在 coding(编码)及相关任务上的能力实现了巨大跃升。在复杂知识工作任务中的准确性和成功率也有显著提升。在我们的 Box AI Complex Work Eval 中,我们将该模型与 Opus 4.8 对比测试,看到它在几乎所有行业中都有大幅提升。在这项 eval 中,我们让使用 Fable 的 Box AI Agent 处理一组与企业文档相关、具有真实世界难度的知识工作问题,然后根据 agent 完成任务的表现进行评分。与 Opus 4.8 相比,Fable 的主要差异在于:它在复杂推理上不会走捷径,能够正确完成多步骤计算,并且在多次运行中的一致性显著更高。我们看到提升最大的领域包括 Media & Entertainment(78% 对 61%)、Technology(81% 对 73%)、Financial Services(89% 对 83%)和 Healthcare(66% 对 60%)。以下是一些具体例子:* Legal M&A due diligence:在一项根据某 semiconductor company 的 contracting policy 审查 NDA 条款的任务中,Fable 正确识别出 joint-ownership clause 违反了 exclusivity requirements,而 liability cap 在 Super Cap exception 下是被允许的。Fable 得分 100%,Opus 为 78%。* Healthcare:在一项针对 12 份报告的临床 radiology 错误审计任务中,Fable 准确按严重等级对每个错误进行了分类,并正确得出不存在 Grade 3 错误的结论。Opus 在证据并不支持的情况下,过早将一个案例升级为“major error requiring immediate departmental review”——Fable 63%,Opus 41%。* Media & Entertainment:在一项类型片盈利能力预测任务中,Fable 正确认识到 20% 的 Argentine tax deduction 已经包含在源 spreadsheet 数据中,因此没有重复应用。Opus 又额外应用了一次——这个在 4 类题材计算中不断累积的错误,使其在该任务上的得分变成负分,而 Fable 为 74%。* Retail analytics:在一项将高增长产品相关文章与投资基准进行分析比较的任务中,Fable 正确分别计算了每篇文章的增长率,并识别出 5 篇中只有 2 篇超过阈值。Opus 将“相对平均值的高增长”和“高于 benchmark(基准)”混淆了——得分 61%,Fable 为 94%。* Financial Services:在一项 5 年期债务融资 projection 任务中,Fable 正确地将利息应用于期初余额,并使用了正确的 capex figure。Opus 则将利息应用于整个融资额度,并从错误的基数计算税额——这是两个会叠加放大的错误。Fable 得分 83%,Opus 为 62%。* Technology:在一项需要跨多个地区计算 Feature Value Index 的 SaaS feature valuation 任务中,Fable 正确应用了公式,并得出了各市场的精确数值。Opus 在多个指标上都算错了——Fable 得分 100%,Opus 为 74%。总体而言,这是在复杂分析、需要分析性推理的工作以及深度领域理解方面的一次巨大跃迁。Fable 很快将在 Box AI Studio 中上线,供客户用来构建 agents。
♥ 177↻ 19💬 26x.com ↗
原文 ↗https://x.com/levie
BuildSpeak — 关于本项目BUILT IN PUBLIC · 跟随 builders 而非 influencers