[1]
i dont think anyone is correctly doing the math around how SpaceX, the NeoCloud+NeoLab, is currently going to market? SpaceX has already recouped about HALF its investment in Cursor, in compute deals. The other half is paid for if Composer 3 does well. No other company is simultaneously a leading model lab + neocloud (at least where GPUs is concerned). its a crazy effective combo iff you've adequately planned out gpu supply if inhouse training 1) goes very well 2) doesn't go very well
我觉得现在还没人真正把 SpaceX,也就是这个 NeoCloud+NeoLab 的市场打法,相关的账算明白吧?SpaceX 光靠 compute(算力)交易,已经收回了它在 Cursor 上大约一半的投资。另一半如果 Composer 3 表现不错,也就能回本了。没有其他公司能同时既是领先的 model lab(模型实验室),又是 neocloud(至少在 GPUs 这件事上如此)。如果你已经充分规划好了 GPU 供应,那么这会是个疯狂高效的组合——无论 inhouse training(内部训练)1)进展非常顺利,还是 2)进展没那么顺利。
btw i've been shopping around for insurers for the New Media Lab we are setting up (basically the creative playground housing swyx inc) and yeah the NPS of Corgi is insanely high my real estate broker: "just go with corgi they are covering every single one of my clients rn" breaking through with ~100% greenfield market share like this is unheard of in the insurance industry
顺便说一句,我最近一直在给我们正在筹建的 New Media Lab 物色保险公司(基本上就是容纳 swyx inc 的创意游乐场),然后,嗯,Corgi 的 NPS(净推荐值)高得离谱。我的房地产经纪人说:“直接选 corgi 吧,他们现在给我的每一位客户都在承保。” 像这样以接近 100% 的 greenfield market share(全新市场份额)实现突破,在保险行业简直闻所未闻。
@QuinnyPig i think this is where i challenge @willccbb to his first poaster session. or maybe @willdepue. or @WilliamBryk. idk all the wills?
@QuinnyPig 我觉得这就是我该向 @willccbb 发起他第一次 poaster session 挑战的时候了。或者也许是 @willdepue。或者 @WilliamBryk。也不知道,所有叫 Will 的人?
@aiDotEngineer @brendanhunting @TedLasso @USMNT @philipkiely this photo looks fucking chatgpt generated
@aiDotEngineer @brendanhunting @TedLasso @USMNT @philipkiely 这张照片看起来他妈的像是 ChatGPT 生成的
btw this is what happens on July 4 if team usa wins this game Wednesday after next
顺便说一句,如果 Team USA 赢下下下个星期三的这场比赛,那么 7 月 4 日就会发生这个
@aiDotEngineer @brendanhunting @TedLasso @USMNT @philipkiely this was key thing to figure out btw @GeminiApp is a VERY good sports handicapper (thanks @OfficialLoganK ). need to draw from a lot of sources to do this
@aiDotEngineer @brendanhunting @TedLasso @USMNT @philipkiely 顺便说一句,这才是需要搞清楚的关键一点:@GeminiApp 是个非常厉害的体育 handicapper(谢谢 @OfficialLoganK)。要做到这一点,需要从很多来源取材
10 years ago, you will be asked by @bendhalpern and @jessleenyc to write your first blog on @thepracticaldev. it is very important that you answer. *now @MLHacks, who are producing the first ever physical daily newspaper at @aidotengineer WF
10 年前,@bendhalpern 和 @jessleenyc 会请你在 @thepracticaldev 上写你的第一篇 blog,这一点非常重要,你一定要回应。*现在 @MLHacks 正在 @aidotengineer WF 制作有史以来第一份实体的日报纸
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr @zoink @Polymarket @aiDotEngineer /goaaaaaaaaaaal
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr @zoink @Polymarket @aiDotEngineer /goaaaaaaaaaaal
+55% in one day. i should start a fund (dm if you would actually help me run one, i have no idea how to run one)
一天就涨了 +55%。我该开始做个 fund 了(如果你真的愿意帮我一起运营,dm 我;我完全不知道 fund 该怎么运作)
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr whoa i had no idea who i was talking to lmao
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr 哇,我之前完全不知道我是在跟谁说话,笑死了
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr best review from a cancer survivor x tech realist
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr 来自一位癌症幸存者兼科技现实派的最佳评价
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr another SUPER fun highlight of my evening was telling @zoink how we are using @polymarket prediction markets to gauge the implied value of our july 1 @aiDotEngineer world cup suite being a team USA game
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr 我今晚另一个超级有趣的高光时刻,是告诉 @zoink 我们如何使用 @polymarket prediction markets(预测市场)来衡量我们 7 月 1 日 @aiDotEngineer world cup suite 作为 team USA 比赛的隐含价值
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr paper
@midjourney @Scobleizer @bryan_johnson @DavidSHolz @iScienceLuvr 论文
@WorkOS @mattpocockuk @zackproser also doing great. AIE audience LOVES the workos content, they are doing something right
@WorkOS @mattpocockuk @zackproser 也做得很棒。AIE 的受众非常喜欢 WorkOS 的内容,他们肯定是做对了什么
@TomasReimers @cursor_ai *github competitor fml
@TomasReimers @cursor_ai *GitHub 竞争对手,fml
havent seen many people outside anthropic ultracode yet. this thing is scarily good at burning tokens but you need to set up your repo to parallelize properly to make use of the fanout that i think subagents are best at. basically the idea is "subroutines but intelligent". when you undersatnd just how much knowledge work is just yakshaves after yakshaves that require some judgment and intelligence, you start to appreciate that dynamic workflows are not just for coding tasks...
还没看到很多 Anthropic 之外的人在用 ultracode。这个东西在烧 token(令牌)方面强得吓人,但你得把自己的 repo 配置好、正确并行化,才能利用我认为 subagents 最擅长的那种 fanout(扇出)能力。基本思路就是“subroutines(子程序),但有智能”。当你理解了有多少知识型工作其实只是一个接一个的 yak shave(为次要前置问题反复折腾),而这些事又确实需要一定判断力和智能时,你就会开始意识到,dynamic workflows(动态工作流)并不只是适用于编码任务……
Satya on loops as IP: > This is the first time we can create a real cognitive loop between people and digital systems. That is a mind-bender, because it changes how we even conceptualize work inside an enterprise. > This means the real opportunity is not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound. You can offload a task, or even a job, but you can never offload your learning > In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country. One where every organization can own the learning loop that encodes its institutional knowledge, compounding its human and token capital.
Satya 谈把 loops(循环)作为 IP(知识产权/核心资产):> 这是我们第一次能够在人和数字系统之间创造一个真正的 cognitive loop(认知循环)。这很颠覆认知,因为它改变了我们甚至如何去概念化企业内部的工作。> 这意味着,真正的机会不在于挑出最好的 model(模型),而在于在模型之上建立一个 learning loop(学习循环),让 human capital(人力资本)和 token capital(token 资本)产生复利。你可以外包一个任务,甚至一份工作,但你永远无法外包自己的学习。> 在我看来,我们的优先事项必须是构建一个 frontier ecosystem(前沿生态系统),而不只是一个 frontier model(前沿模型),这样价值才能广泛流向每一家公司、每一个行业和每一个国家。在这样的生态里,每个组织都能拥有那个编码其 institutional knowledge(机构知识)的 learning loop,并让其 human capital 和 token capital 持续复利增长。
Last chance to fill out the annual AI Engineering Survey this weekend and win great Vercel + Notion + AIE tix! link below we had @devinai analyze registered attendee list and output a live chart of the people coming to the conference. it ended up being the single best data driven storytelling i've ever seen on what kind of community we are gathering in two weeks. survey link here! no lurking, fill it out pls
这是你在这个周末填写年度 AI Engineering Survey 并赢取超棒的 Vercel + Notion + AIE tix 的最后机会!链接见下。我们让 @devinai 分析了已注册参会者名单,并输出了一张关于将来参加 conference 的人群的实时图表。结果证明,这成了我所见过最出色的一次 data-driven storytelling(数据驱动叙事):它展现了两周后我们正在汇聚成一个什么样的 community。survey 链接在这里!别潜水了,快去填 pls
how your email finds me (if youre waiting for a decision or reply pls dont take it personally im just in peak crunch mode for aie)
这就是你的邮件通常会在什么状态下被我看到(如果你在等一个决定或回复,pls 不要往心里去,我只是正处在 aie 的高压赶工期)
## The Future Codebase After the PR dies, after the Code Review dies, i am seriously wondering if Git needs to die next. roughly 20-40% of code spend is just managing and updating merge conflicts. necessary evil? or legacy "horseless carriage"? cargo culting the past? we don't do line by line merge conflicts when we collaborate with human colleagues - instead we chat, suggest edits, do side comments, and an owner ships it. btw we also don't do CI/CD even collaborating on documents with serious legal/financial implications. maybe the future codebase looks more like a Notion or Linear database than .git objects. It will be less efficient, but more scalable. exactly the Salty Lesson.
## 未来的代码库 在 PR 消亡之后,在 Code Review 消亡之后,我很认真地在想,接下来是不是连 Git 也该消亡。大约 20–40% 的代码投入,只是花在管理和更新 merge conflict(合并冲突)上。这是必要之恶?还是一种遗留的“horseless carriage(无马马车)”?是在对过去进行 cargo cult(货物崇拜)式模仿吗?我们和人类同事协作时,并不会按行处理 merge conflict——相反,我们会聊天、提出修改建议、写侧边评论,然后由某个 owner(负责人)来发布。顺便说一句,即使是在协作处理具有重大法律/财务影响的文档时,我们也不会用 CI/CD。也许未来的代码库,看起来会更像 Notion 或 Linear 的数据库,而不是 .git objects。它的效率会更低,但可扩展性会更强。完全就是 The Salty Lesson。
neat thing about developer exception engineering is: the happy paths are all happy in their own way. the unhappy paths are ~universally the same.
developer exception engineering 的一个妙处在于:happy path(正常路径)各有各的顺法;而 unhappy path(异常路径)却几乎到处都一样。
## On Loopcraft One might argue the entire game of the next century is to be able to stack loops as effectively as possible. In the early days of each phase, it will be valuable to know when to go **DOWN** a loop when things go wrong (for reliability)… but it will probably be more valuable to know how to go **UP** a loop as models improve (for leverage). If you don’t figure out how to do this, don’t be salty when you lose to those that do.
## 关于 Loopcraft 有人可能会说,下个世纪的整个游戏,就是要尽可能高效地堆叠 loops(循环)。在每个阶段的早期,当事情出错时,知道什么时候沿着 loop **向下**走会很有价值(为了 reliability,可靠性)……但随着 models(模型)变得更强,知道如何沿着 loop **向上**走,大概会更有价值(为了 leverage,杠杆效应)。如果你搞不清楚该怎么做,那么当你输给那些会做的人时,也别酸。
the #1 thing that is driving me to build my own vibecoding platform rn is that none of them - and i lov vercel, cloudflare, netlify etc - none of them really close the loop for you in terms of setting you on the right path with errors and pinging you when shit fails (shit always fails) there's way too much "webmaster" infra to setup for every single project and i just want to do it once and for all, instead i'm being asked to npx posthog wizard here and npx arize skills there and it all just needs to be swallowed up into One Thing.
眼下,驱动我去自己做一个 vibecoding 平台的头号原因是:现有这些平台——而且我真的很喜欢 vercel、cloudflare、netlify 等——没有一个能真正帮你把 loop(闭环)补上,也就是说,在报错时把你引到正确路径上,并且在东西挂掉时提醒你(东西总是会挂)。每一个项目都要搭太多“webmaster”式的 infra(基础设施),而我只想一劳永逸地把这事做一次;结果现在却还要我这里 npx posthog wizard、那里 npx arize skills,这一切都应该被吞并进 One Thing。
congrats to our friends @ona_hq on joining @openai! see their talk here for alpha on what’s next for Codex 👀
恭喜我们的朋友 @ona_hq 加入 @openai!想知道 Codex 接下来会怎样,可以看看他们这场 talk,里面有一些 alpha(一手消息)👀
btw insane amounts of alpha in telling claude code to "review my code for issues" on Fable rn while it is not pay per use be prepared to be in abject horror that you shipped anything to prod without a Fable Check™ first
顺便说一句,现在在 Fable 上让 claude code “review my code for issues”(检查我的代码是否有问题)简直有海量 alpha(超额价值 / 有用信息),尤其是在它目前还不是按次付费的时候;你最好做好心理准备:一想到自己以前没先做一次 Fable Check™ 就把东西发到 prod(生产环境),可能会陷入彻底的惊恐。
for those keeping track at home it was 34 days between signing this deal and launching Mythos-class model GA to the world. building on @nvidia stack means you can just do things™.
给那些还在持续关注进展的人一个时间点:从签下这笔 deal(交易 / 合作)到把 Mythos-class model 正式以 GA(General Availability,正式可用)形式发布给全世界,中间只隔了 34 天。基于 @nvidia 的 stack(技术栈)来构建,意味着你真的可以“直接把事情做成”™。
more charts of other tiers where its less stark including the vibe shift chart from here
这里还有更多其他 tier(层级 / 档位)的图表,那里的差异没这么夸张,其中也包括这里这张关于 vibe shift(整体风向 / 氛围转变)的图表。
@METR_Evals previously on @cognition_labs
@METR_Evals,此前在 @cognition_labs
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000+ hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality. Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000+ rubrics covering code quality and anticheat reward hacking plaguing other benchmarks. FC Diamond is so hard that Opus 4.8 scores 13.8%. Three eras of AI coding : Three eras of benchmarks 2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months. This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails. My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027.... The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
它终于发布了!!!@METR_Evals 发现,超过一半的 SWEBench 结果都是无法合并的 slop(低质产出)。FrontierCode 代表了 1000+ 小时经 maintainer(维护者)验证的软件工程工作,而大多数 frontier models(前沿模型)目前还无法解决这些工作,更不用说高质量地解决了。Cog 拥有 IOI Gold medalists 和顶级代码维护者——Look At The Data——FrontierCode 包含 3000+ 条 rubrics(评分细则),覆盖代码质量以及困扰其他 benchmarks(基准测试)的 anticheat reward hacking(反作弊奖励破解/刷分)问题。FC Diamond 难到连 Opus 4.8 的得分都只有 13.8%。AI coding 的三个时代:benchmarks(基准)的三个时代。2021 • Autocomplete:HumanEval。2023 • Passing Tests:SWEBench、TerminalBench。2026 • Maintainable Code:FrontierCode。对我来说,最漂亮的一张图是:当我要求对所有现存旧模型做一次特别的历史回跑时,数据显示 FC 任务中最简单的三分之一(在 FC Extended 中)在 2025 年末被迅速且突然地攻克了——Opus 在 4 个月内几乎翻倍,从 41% 的 pass rate(通过率)升到 74%。这描述了很多人——从 @dhh 到 @karpathy——都指出的那种“2025 年 12 月到底发生了什么”的 vibe shift(氛围转变):区别在于,你是在 2 次 rerolls(重试)内拿到 95% 成功率,还是要 6 次;这最终让 agentic coding(agent 驱动编程)向上进入下一层抽象成为可行之事,例如 @GeoffreyHuntley 的 ralph loops、@bcherny 的 /goals,或 @steipete 所说的“让你的 agents 接收提示并循环执行的 loops”,而不必太担心事情彻底失控。我的猜测是:随着 AI 从这里开始继续加速,FrontierCode 的每个 tier(层级)都会依次饱和,希望大约是每年一个。我已经让团队开始准备 FrontierCode 2027 了……旧日的高山将被摧毁,它们的碎石会变成 regolith(月壤/风化层);而从那 regolith 之中,下一片模型森林将会生长。生命循环。
one popular theory is that research paper alpha* and lab publishing ~died when researchers realized that instead of fighting with marketing depts they could simply walk out the door and get >$100m for their legally protected tacit knowledge gained california non-noncompetes have a bigger impact on knowledge spreading than github, arxiv, and huggingface combined *btw this is a motivator for me to set up @aidotengineer as a product-centric industry conference to complement the paper-centric research conferences
一个很流行的理论是:当研究人员意识到,与其和 marketing depts(市场部门)周旋,不如直接走出公司大门,凭借他们依法受保护的 tacit knowledge(隐性知识)拿到超过 1 亿美元时,research paper alpha* 和 lab publishing(实验室论文发表)就大致“死掉”了。California 的 non-noncompetes(禁止竞业限制)对知识传播的影响,比 GitHub、arXiv 和 Hugging Face 加起来还要大。*顺便说一句,这也是我建立 @aidotengineer、把它做成一个以产品为中心的行业 conference(会议),用来补充那些以论文为中心的研究 conference(会议)的动机之一。
a smarter alternative to "always use plan mode": always frame your task as a question, so that the model is invited to push back and rate the quality of the idea/suggest alternatives, rather than blindly execute what you SAID to do (which is often not precisely what you MEANT) literally just appending "?" to the end of your prompt often does it
相比“总是使用 plan mode”的更聪明替代方案是:始终把你的任务表述成一个问题,这样模型就会被引导去提出异议、评估这个想法的质量并给出替代方案,而不是盲目照着你“说”要做的事去执行(而那往往并不精确等于你“本来想表达”的意思);很多时候,真的只要在你的 prompt 末尾加上一个“?”就行了
i love being (for now) bdfl for aie because i can do cheeky shit like the AGI pills we did in london and also this
我很喜欢自己(至少现在)作为 aie 的 bdfl,因为这样我就能搞一些我们在 london 做过的 AGI pills 之类的皮操作,还有这个也是
@aiDotEngineer lmao designer vincent back at it again with the frontier capability tests
@aiDotEngineer 哈哈哈,designer vincent 又回来整 frontier capability tests 了
about time that the leading database company born in Singapore actually had Singapore investors take it seriously!
这家诞生于 Singapore 的头部数据库公司,也差不多该让 Singapore 的投资者认真对待它了!
Finally! the first eval ship from cog!!!!!!!!!! 👼🏼 To contextualize: @METR_Evals cap out at ~16 hours. Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯 METR dataset: ML eng, GPU kernels, cybersecurity > "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83 Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations > "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!
终于!来自 cog 的第一份 eval(评测)发布了!!!!!!!!👼🏼 给一点背景:@METR_Evals 的上限大约是 16 小时。Cog 有私有的 enterprise(企业)eval,最长可到 100 小时,而且他们甚至有信心为此附上 financial guarantee(财务担保)🤯 METR 数据集:ML eng、GPU kernels、cybersecurity > “METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth”. rlog 为 0.83。Cog 数据集:真实世界中的 java/typescript/python/c# 功能开发、bug 修复、迁移 > “We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers.” 在留出测试集(held out set)上的 rlog 为 0.74。这是开创性的真实世界 eval 工作,也是更大范围前沿代码 eval 发布的第 1 部分,我非常期待把它详细写出来。非常感谢 @annarmitchell 和 @ryanbai1412 牵头完成了这些并不光鲜、但至关重要的最后一公里数据收集工作!!
@HamiltonMusical the most viral thing i have ever done and its a bootleg hamilton sitzprobe not anything ai related
@HamiltonMusical 是我做过传播最火的东西,而且那还是一段盗录的 Hamilton sitzprobe,根本不是什么和 AI 有关的内容
@jgreze will speak on this at gathering all the top agent labs. lfg
@jgreze 会在 gathering 上谈这个,届时所有顶尖的 agent(智能体)实验室都会到场。lfg
@saranormous codex is agi man oneshotted this, no notes
@saranormous,codex 简直是 AGI,兄弟一发就把这事搞定了,没啥可说的
probably the best reward function for reasoning efficiency i've seen
这可能是我见过用于推理效率的最好的 reward function(奖励函数)
@soumithchintala @pewdiepie @opencode
@soumithchintala @pewdiepie @opencode
just a small zoom out on the vibe shift: in Feb 2025 @soumithchintala was talking about his dream of personal, local, private agents, most people didn't believe him. it's June 2026 and @pewdiepie has just released his vibecoded @opencode wrapper that is a complete personal AI productivity suite including email, docs, and calendar. top of HN, easily >1m views, >10k stars in a day. if your Knowledge Work Agents startup can't beat pewdiepie you might as well pack up and go home at this point, his is the benchmark for what you can DIY.
对这种氛围转变(vibe shift)稍微拉远一点看:在 2025 年 2 月,@soumithchintala 还在谈论他关于个人、本地、私有 agent 的梦想,当时大多数人并不相信他。现在是 2026 年 6 月,@pewdiepie 刚刚发布了他用 vibecoded 方式做出来的 @opencode wrapper,已经是一整套完整的个人 AI 生产力套件,包含 email、docs 和 calendar。登上 HN 榜首,轻松超过 100 万浏览,一天内超过 1 万 stars。要是你的 Knowledge Work Agents 创业公司连 pewdiepie 都打不过,那这时候基本可以收摊回家了——他做出来的东西,就是现在你自己动手(DIY)能达到什么水平的 benchmark(基准)。
every evals/analytics startup is going through a onetime generational upgrade into a continual learning platform in 2026 many will fail but as always the tasteful ones win
到了 2026 年,每一家做 evals/analytics 的创业公司,都在经历一次一代人只会发生一次的升级:从原来的形态转向 continual learning platform(持续学习平台);很多都会失败,但一如既往,真正有品位的那些会赢。
[1]
last 4 days to submit talks: this is our first year featuring PREPRINT poster sessions for research papers as well - as @bclavie pointed out we need a separate process for this but you can submit here for now!
距离提交演讲 proposal 还剩最后 4 天:这是我们第一年也为 research papers 设立 PREPRINT 海报 session——正如 @bclavie 指出的那样,我们确实需要为此单独设置一个流程,但你现在仍然可以先在这里提交!
btw we did a bake off of Exa vs competitors and it took all of 1.5 hrs for the team to unanimously converge on exa lol. so proud to see my former landlords crush it - time travel back to last year and listen to a pre pmf @WilliamBryk to understand how to spot companies on a generational tear
顺便说一句,我们把 Exa 和竞品做了一轮 bake-off(对比测试),整个团队只花了 1.5 小时就一致认定 Exa 胜出,lol。很自豪看到我以前的房东们大杀四方——把时间倒回去年,去听听 pre pmf(达到 product-market fit 之前)的 @WilliamBryk,你就会明白该如何识别那种正在经历代际级爆发的公司
very belated but in retrospect i think @sama's mythical "build a business that gets better when models get better" is basically what I called Agent Labs here. seeing a very direct correlation with model performance and agent lab revenue, discontinuity in Q4 2025 (clip from @patrickc's stripe sessions)
虽然现在说已经晚了很多,但事后看,我觉得 @sama 那句近乎神话般的话——“build a business that gets better when models get better”——基本上就是我在这里所说的 Agent Labs。我看到 model(模型)性能和 agent lab 营收之间存在非常直接的相关性,并且在 2025 年 Q4 出现了不连续跃迁(摘自 @patrickc 的 stripe sessions)
there's 4 parts to this AI SDLC 1. have ~50 tests in place, with instructions to add more, including "make a memory that whenever you do browser e2e tests, use computer vision to visually spot check design and ux issues as well on mobile/desktop/ipad/ultrawide resolutions" 2. "/plan break up & edit hot paths so you isolate files for easier editing and reading. add proper logging and error boundaries/handling while you do it. what else should we refactor for maintainability/performance/ai editing?" 3. (with plan) "you can break backward compatibility. first map out all the remaining work, then proceed on this next slice, do not stop until all work is done, periodically stop to commit, deploy and test but do not stop until all work is done" 4. [periodically spot check deployed functionality and /steer bugs as it goes along]
这个 AI SDLC(软件开发生命周期)有 4 个部分:1. 先准备好大约 50 个测试,并附上继续添加更多测试的说明,包括“建立一条 memory(记忆):每当你做 browser e2e tests(浏览器端到端测试)时,也要用 computer vision(计算机视觉)从视觉上抽查 design 和 ux 问题,并覆盖 mobile/desktop/ipad/ultrawide 分辨率”;2. “用 /plan 拆分并编辑 hot paths(关键路径),这样你就能隔离文件,便于编辑和阅读。在这个过程中补上合适的 logging(日志记录)以及 error boundaries/handling(错误边界/错误处理)。另外,为了 maintainability/performance/ai editing(可维护性/性能/AI 编辑),我们还应该重构什么?”;3. (基于 plan)“你可以破坏 backward compatibility(向后兼容性)。先把所有剩余工作都梳理出来,然后继续推进下一块,不要停,直到所有工作都完成;期间可以定期停下来 commit、deploy 和 test,但在所有工作完成之前不要停”;4. [定期抽查已部署的功能,并在推进过程中用 /steer 处理 bugs]
this seems quite doable in the space of a single 2-3 hour workshop — any brave soul want to try to livecode this for people as a learning exercise?
这件事看起来相当可行,完全可以放在一场 2–3 小时的 workshop(工作坊)里完成——有没有勇敢的人愿意把它作为学习练习,现场 livecode(直播编码)给大家看?
@gabrielchua the agentic excel thing is basically what u get when u expand the side panel to be the main thing
@gabrielchua,所谓那个 agentic excel 的东西,基本上就是把侧边栏扩展成主要界面后你会得到的样子
some of us doing kaya toast breakfast here at 11am if u are still around
我们这里有些人打算上午 11 点去吃 kaya toast 早餐,如果你还在附近的话
@Gavriel_Cohen i have to say his social media team is better than mine wtf. first pull on youtube.
@Gavriel_Cohen 我得说,他的社交媒体团队比我的强,wtf。第一次在 youtube 上发力。
gotta say Codex is completely unrecognizable from 3 months ago. guys went extreme founder mode on this thing @gabrielchua was demoing this and i was like “you guys have agentic excel on mac”
得说一句,Codex 跟 3 个月前比已经完全认不出来了。你们这帮人真是对这东西开启了极致 founder mode(创始人模式)。@gabrielchua 当时在演示这个,我心里就在想:“你们这是把 agentic excel(具备 agent 能力的 Excel)做到了 Mac 上啊。”
@Gavriel_Cohen @thsottiaux head of AI Govtech at Singapore estimates 1.3 billion agents in the country in the next 2 years and is building a national MCP gateway @dsp_
@Gavriel_Cohen @thsottiaux 新加坡 AI Govtech(政府科技中的 AI)负责人估计,未来 2 年这个国家里会有 13 亿个 agents(智能体),并且正在打造一个国家级的 MCP gateway。@dsp_
@Gavriel_Cohen and @thsottiaux casually dropping some hints on the Codex roadmap in his keynote!
@Gavriel_Cohen 和 @thsottiaux 在他的 keynote(主题演讲)里,很随意地放出了一些关于 Codex roadmap(路线图)的暗示!
Apparently at @AIEMiami geoff complained about @SAPConcur being dead software and a SAP guy was in the audience and invited him to SAP to advise on how to do AI transformation for 6800 employees TLDR he made fun of SAP, and SAP… concurred
显然,在 @AIEMiami 上,geoff 抱怨说 @SAPConcur 是死掉的软件,而现场观众里刚好有个 SAP 的人,于是邀请他去 SAP,为 6800 名员工提供关于如何做 AI 转型的建议。TLDR(太长不看版):他拿 SAP 开了个玩笑,而 SAP…… concurred(也“同意”了;双关 SAPConcur)。
Blogs die when they come from "the ____ team" instead of named individuals With great ownership comes great accountability
当 Blog 不是出自有名有姓的个人,而是来自“某某 team”时,它们就开始走向死亡。越有 ownership(主人翁意识/责任归属),就越有 accountability(问责)。
[1]
I believe the kids call this "@thinkymachines just brutally framemogged gdm and oai". basically everyone's definition of "realtime" just got a massive frciking upgrade
我觉得现在小孩会把这叫作“@thinkymachines 直接把 gdm 和 oai 在叙事框架上狠狠干翻了”。基本上,所有人对“realtime(实时)”的定义刚刚都被大幅他妈地升级了一遍
on build vs buy saas cc @levie for corrections
关于自建(build)还是购买(buy)SaaS,抄送 @levie 以便指正
@VivianBala we will finally show the world how it is done.
@VivianBala,我们终于要向全世界展示这件事该怎么做了。
OK I'VE BEEN SO EXCITED i could barely keep this a secret all week and it's finally official MY HOME COUNTRY'S MINISTER OF FOREIGN AFFAIRS (equiv to Secretary of State) IS A HUGE NANOCLAW FAN (check @VivianBala, that's really him, not an intern) AND WILL BE KEYNOTING @AIDOTENGINEER SINGAPORE (with NanoClaw creator @Gavriel_Cohen right after) NEXT WEEK Usecases like his are what I have been hoping to promote with the international AIE partnerships and @agrimsingh and @SherryYanJiang crushed it with this one. governments waking up to AI and joining @aiDotEngineer: UK: Chief AI Officer Singapore: Cabinet Minister who's next??
好吧,我真的兴奋坏了,整整一周几乎都憋不住这个秘密,现在终于正式官宣了——我祖国的 Minister of Foreign Affairs(相当于 Secretary of State)竟然是个超级铁杆的 NANOCLAW 粉丝(去看 @VivianBala,那确实是他本人,不是什么实习生),而且他下周还将担任 @AIDOTENGINEER SINGAPORE 的 keynote speaker(随后紧接着就是 NanoClaw 的创造者 @Gavriel_Cohen)。像他这样的 use case(使用案例),正是我一直希望通过国际 AIE 合作伙伴关系来推动的;而 @agrimsingh 和 @SherryYanJiang 这次真的把这件事做得太漂亮了。各国政府正在觉醒,开始拥抱 AI,并加入 @aiDotEngineer:UK:Chief AI Officer;Singapore:Cabinet Minister——下一个会是谁??
wondering if @embirico has numbers on what % of codex users use this mode and how much it has gone up over the last month its a decent proxy for alignment/agent adoption
想知道 @embirico 那边有没有数据:codex 用户中有多少百分比在使用这个 mode(模式),以及过去一个月这个比例增长了多少;这可以作为 alignment(对齐)/agent 采用情况的一个还不错的 proxy(代理指标)。
@nikitabier @business some good sourcing. seems potentially state level.
@nikitabier @business 消息来源不错。看起来可能涉及州政府层面。
@nikitabier @business bloomberg being suddenly interested in your take on developer experience and ai coding tools is the new "sexy singles in your area"
@nikitabier @business Bloomberg 突然对你关于 developer experience 和 AI coding tools 的看法感兴趣,简直就是新版“你所在地区的火辣单身人士”广告。
docusign !!? fuck docusign with a sharp stick
docusign !!? 去他妈的 docusign,拿根尖棍子狠狠干它
i was going to send him my loom showing him gustos bugs and loom loomed on me
我本来要把我的 loom 发给他,给他看 gusto 的 bugs,结果 loom 反倒把我给“loom”了
be aware of this kind of phishing. i was almost tricked. cc @nikitabier @business
小心这种 phishing(网络钓鱼)。我差点就上当了。cc @nikitabier @business
OAI 850B valuation, ~30B ARR now Ant ~900B valuation, ~44B* ARR now *revenue recognized differently, per Denise Dresser its probably closer to 8-10B lower if using OAI methodology chart reconstructed from wsj by me
OAI 现在的估值是 850B,当前 ARR(年度经常性收入)约为 30B;Ant 现在的估值约为 900B,当前 ARR 约为 44B*。*收入确认方式不同;根据 Denise Dresser 的说法,如果使用 OAI 的方法,可能实际上要低 8–10B 左右。图表由我根据 wsj 重建。
see the talk version, out now thanks to @steveruizok
可以看演讲版本,现已发布,感谢 @steveruizok
this one is doing v well btw if you want the popular vote filter on the firehose of all the things @patrickdebois was one of the track keynotes i gave a "blank check" to based on his sincere support since our very earliest days + when in europe we must feature the DevOps guy. he didnt disappoint!
顺便说一句,这个表现非常好;如果你想在所有内容的 firehose(信息洪流)里用“popular vote”来做筛选的话。@patrickdebois 是我做主题演讲的其中一个分会场;基于他从我们最早期开始就给予的真诚支持,我给了他一张“blank check(全权信任)”;而且人在欧洲,我们就必须安排那位 DevOps 大神出场。他没有让人失望!
OAI 850B valuation, ~30B ARR now Ant ~900B valuation, ~44B* ARR now *revenue recognized differently, per Denise Dresser its probably closer to 8-10B lower if using OAI methodology chart reconstructed from wsj by me
OAI 现在的估值是 850B,当前 ARR(年度经常性收入)约为 30B;Ant 现在的估值约为 900B,当前 ARR 约为 44B*。*收入确认方式不同;根据 Denise Dresser 的说法,如果使用 OAI 的方法,可能实际上要低 8–10B 左右。图表由我根据 wsj 重建。
see the talk version, out now thanks to @steveruizok
可以看演讲版本,现已发布,感谢 @steveruizok
this one is doing v well btw if you want the popular vote filter on the firehose of all the things @patrickdebois was one of the track keynotes i gave a "blank check" to based on his sincere support since our very earliest days + when in europe we must feature the DevOps guy. he didnt disappoint!
顺便说一句,这个表现非常好;如果你想在所有内容的 firehose(信息洪流)里用“popular vote”来做筛选的话。@patrickdebois 是我做主题演讲的其中一个分会场;基于他从我们最早期开始就给予的真诚支持,我给了他一张“blank check(全权信任)”;而且人在欧洲,我们就必须安排那位 DevOps 大神出场。他没有让人失望!
ok @deepfates @mada299 it took me 3 years to do my second short story but i did one
好吧,@deepfates @mada299,我花了 3 年时间才完成我的第二篇短篇故事,但我还是做出来了一篇。
[1]
Much respect to @tokengobbler who shutdown Vibe-kanban live onstage at AIE Europe - still with 30,000 MAU, and still living on as an open source project. "Everyone who is making money is doing 2 things: selling to enterprise, and reselling tokens. We were doing neither." surprisingly not the first company to shutter at AIE but there's a lot to learn from the process and the software engineering retrospective from 2021-2025 will stick in my mind!
向 @tokengobbler 致以极大敬意——他在 AIE Europe 的台上现场关闭了 Vibe-kanban;它当时仍有 30,000 MAU(月活跃用户),并且仍以开源项目的形式延续着生命。“所有在赚钱的人都在做两件事:卖给 enterprise(企业客户),以及转售 token(代币)。而我们两件都没做。” 令人意外的是,这竟然不是第一家在 AIE 关闭的公司,但这个过程以及这份关于 2021-2025 年 software engineering(软件工程)的复盘,确实有很多值得学习的地方,也会一直留在我的脑海里。
i said on @jacobeffron's pod recently that "coding agents breaking containment" is the breakout theme of the year. i meant it - this is the year all knowledge workers, not just coders, get AGI-pilled. for the AIE EU closing note ( I gave a short talk on how we use agents to run @aidotengineer as a Tiny Team that now serves ~1m unique developers a month for free all around the world, for everything from CMS to renting lobster inflatables. yes I use @openclaw personally and as a team we use @cognition's Devin and @townai, but this isn't about any one agent; it's about all of them, and how you are probably not trying hard enough to use them for daily knowledge work. i hope this gives you agent productivity ideas for you and your team.
我最近在 @jacobeffron 的 pod 上说过,“coding agents(编程 agent)突破隔离边界”是今年最重要的爆发主题。我是认真的——今年将是所有 knowledge workers(知识工作者),而不只是 coders(程序员),都被 AGI-pilled 的一年。为了 AIE EU 的 closing note(闭幕致辞)(我做了一个简短分享,讲我们如何使用 agents(agent)把 @aidotengineer 作为一个 Tiny Team 来运营,而它现在每月为全球约 100 万 unique developers(独立开发者/唯一开发者用户)免费提供服务,应用场景从 CMS 到租赁 lobster inflatables 都有。是的,我个人会用 @openclaw,我们团队也会用 @cognition 的 Devin 和 @townai,但这并不是关于某一个 agent;而是关于所有 agent,以及你很可能还没有足够努力地把它们用于日常 knowledge work(知识工作)。我希望这能为你和你的团队带来一些关于 agent productivity(agent 生产力)的想法。
> be me > "the internet is polluted by ai slop, we need low-background tokens" > "wouldnt it be cool if we could time travel and see what our ancestors 100 years ago would say to us" > all the existing vintage models are like <4B > we need a chat tuned 13B vintage model > assemble avengers of ML incl the GPT-1/2 guy > need vintage tokens > train new vintage OCR model for old books, newspapers, periodicals, scientific journals, patents, and case law > need vintage RLHF but cant use chat > synthesize RLHF pairs from historical texts with regular structure eg etiquette manuals, letter-writing manuals, cookbooks, dictionaries, encyclopedias, and poetry and fable collections, shove it into ChatML > train it > future knowledge still got in somehow > dammit.jpg > train new SOTA document-level n-gram-based anachronism classifier > meticulously curate hundreds of billions of pre-1931 tokens (public domain) > train it > ok! it checks out vs our FineWeb baseline! > release it > it's the most confidently racist model ever released by humankind > mfw
> 设想我是我自己 > “互联网已经被 ai slop(AI 垃圾内容)污染了,我们需要 low-background tokens(低背景噪声 token)” > “要是我们能穿越时间,看看 100 年前的祖先会对我们说什么,不是很酷吗” > 现有的所有 vintage models(复古模型)基本都小于 4B > 我们需要一个经过 chat tuned(聊天调优)的 13B vintage model > 集结 ML 的 avengers,包括那个 GPT-1/2 guy > 需要 vintage tokens > 为 old books、newspapers、periodicals、scientific journals、patents 和 case law 训练新的 vintage OCR model > 需要 vintage RLHF,但又不能用 chat > 从具有规则结构的 historical texts 中合成 RLHF pairs,比如 etiquette manuals、letter-writing manuals、cookbooks、dictionaries、encyclopedias,以及 poetry 和 fable collections,然后一股脑塞进 ChatML > 训练它 > 结果还是不知怎么混进了 future knowledge(未来知识) > dammit.jpg > 再训练一个新的、SOTA 的 document-level、基于 n-gram 的 anachronism classifier(时代错置分类器) > 精心整理出数千亿个 1931 年前的 token(public domain,公版) > 训练它 > 好!跟我们的 FineWeb baseline 对比,确实过关了! > 发布它 > 结果它成了人类有史以来发布过的最自信的 racist model(种族主义模型) > mfw
i havent done the work to compare it to peers but i'm just excited that we have a base model and honestly for all the people that complained about the death of the completions API (@deepfates ? or deepfates adjacent) not enough people are experimenting with weird usages and finetunes of the base models we DO get
我还没做足够的工作把它和同类模型比较,但我只是单纯很兴奋,因为我们现在有了一个 base model;老实说,那些曾经抱怨 completions API 死掉的人(@deepfates?或者跟 deepfates 一路的人)里,去实验我们现有这些 base models 的各种奇怪用法和 finetune(微调)的人,实在还不够多
ok we have a tiktok account now with some BTS
好的,我们现在有一个 TikTok 账号了,里面有一些 BTS 内容。
btw we are cooking something with @hhua_ (not final yet but keep calendar open after ICML in Seoul)
顺便说一句,我们正在和 @hhua_ 一起筹备点东西(还没最终敲定,但 ICML in Seoul 之后请把日程空出来)
wow another engineer on the “code is not cheap” train
哇,又有一位 engineer 加入了“code 并不便宜”这趟列车
fun to think about what the pm thinks vs what the engineer thinks in this scenario
想想在这种场景里 pm 会怎么想、engineer 又会怎么想,还挺有意思的
the Codex x @skybysoftware acquisition may have been one of the best @openai deals made in the last year. I've been waiting for "real" computer use since @romainhuet demoed the ChatGPT App with 4o Vision at AIEWF 2024... and only now it's really, actually rolling out in a usable fashion.
Codex x @skybysoftware 的收购,可能是过去一年里 @openai 做过的最好的交易之一。自从 @romainhuet 在 AIEWF 2024 用 4o Vision 演示 ChatGPT App 以来,我就一直在等“真正的” computer use(计算机使用)……直到现在,它才终于以一种真正可用的方式开始推出。
and @dexhorthy is quoting Z/L continuum in AIE Miami!! idea catching on @altryne
而且 @dexhorthy 还在 AIE Miami 提到了 Z/L continuum,这个想法正在 @altryne 那里传播开来!!