Genuinely impressive release by Google today (remember when they were behind?) Gemini 3.5 Flash perf: * Building on prior strengths (83.6% of MMMU-Pro for multimodal), * big jump on agentic coding (76.2% on Terminal-Bench for agentic coding and 56.5% on Toolathon for real world tasks) * progress and expert tasks (57.9% on Finance Agent 2... we are cooked) * leading scores across SWE-Bench, OSWorld etc. (also, elegant to bold the top scores in the chart below even if when it's not Google leading) Ofc, just benchmarks, and also not cheap (~$9/M output), but Google is cookin'... we are all so spoiled to have the 3 labs compete
Google 今天发布的东西确实令人印象深刻(还记得他们以前还落后吗?)Gemini 3.5 Flash 的性能:* 在既有优势上继续提升(多模态方面,MMMU-Pro 达到 83.6%),* agentic coding(代理式编程)有大幅跃升(在 Terminal-Bench 上做 agentic coding 达到 76.2%,在 Toolathon 上做现实世界任务达到 56.5%)* 在进展与专家级任务上也有提升(Finance Agent 2 上达到 57.9%……我们完了)* 在 SWE-Bench、OSWorld 等基准上也拿到领先分数(而且很巧妙的是,下面图表里把最高分都加粗了,即使领先的不是 Google)当然,这些终究只是 benchmarks(基准测试),而且价格也不便宜(输出约 ~$9/M),但 Google 确实火力全开……我们这些人真是太幸福了,能看到这 3 家实验室彼此竞争
Breaking: Anthropic attains sainthood, officially annointed by AI Jesus
突发:Anthropic 荣登圣坛,已获 AI Jesus 官方加冕为圣