about time that the leading database company born in Singapore actually had Singapore investors take it seriously!
这家诞生于 Singapore 的头部数据库公司,也差不多该让 Singapore 的投资者认真对待它了!
Finally! the first eval ship from cog!!!!!!!!!! 👼🏼 To contextualize: @METR_Evals cap out at ~16 hours. Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯 METR dataset: ML eng, GPU kernels, cybersecurity > "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83 Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations > "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!
终于!来自 cog 的第一份 eval(评测)发布了!!!!!!!!👼🏼 给一点背景:@METR_Evals 的上限大约是 16 小时。Cog 有私有的 enterprise(企业)eval,最长可到 100 小时,而且他们甚至有信心为此附上 financial guarantee(财务担保)🤯 METR 数据集:ML eng、GPU kernels、cybersecurity > “METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth”. rlog 为 0.83。Cog 数据集:真实世界中的 java/typescript/python/c# 功能开发、bug 修复、迁移 > “We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers.” 在留出测试集(held out set)上的 rlog 为 0.74。这是开创性的真实世界 eval 工作,也是更大范围前沿代码 eval 发布的第 1 部分,我非常期待把它详细写出来。非常感谢 @annarmitchell 和 @ryanbai1412 牵头完成了这些并不光鲜、但至关重要的最后一公里数据收集工作!!