0可信
70-100可信40-69普通0-39不可信

@karpathyAndrej Karpathy

帳號簡介

頂尖 AI 研究者與教育者,前 Tesla AI 總監、OpenAI 共同創辦人,主要分享 LLM 技術深度研究、nanochat 開源專案進展、AI agent 工作流心得,以及對產業趨勢的第一手觀察。

分析摘要

Andrej Karpathy 為全球知名 AI 研究者,貼文以深度技術原創內容為主,涵蓋 LLM 訓練優化、自動研究框架、程式設計工作流演變等主題。唯一可辨識風險為偶爾推廣自身天使投資的公司,雖有揭露但推薦語氣與一般產品推薦無異。

商業置入
前往 X 查看此帳號其他報告

2026/3/13 分析 · 使用者 #73e618 提供 49 則貼文 (2023-01-24 ~ 2026-03-11)

風險分析

商業置入

[23] 中推薦 Simile AI 並揭露為天使投資人;在 [17] 中讚揚 MatX 團隊並提及「my pleasure to have a small involvement」;在 [39] 中大力推薦一間未具名研究新創。三則皆有不同程度的揭露,但推薦語氣熱情,讀者可能不易區分客觀技術評論與投資人立場。另外 [27] 對 DeepWiki 的推薦雖無揭露財務關係,但描述方式接近產品代言。

帳號數據

49 則貼文橫跨約 14 個月(2025-12 至 2026-03),平均每週約 3-4 則,發文時間集中在美西下午至深夜(PST 17:00-00:00)。原創比例極高(46/49,94%),僅 3 則轉貼。發文頻率不均勻,常在技術突破或重大事件期間密集發文(如 2026-03-05 至 03-11 連續多則 autoresearch 相關貼文),符合研究者靈感驅動的自然節奏,無明顯排程工具痕跡。

發文時段分佈

00:0003:0006:0009:0012:0015:0018:0021:00
1/24
1/25
1/26
1/27
1/28
1/29
1/30
1/31
2/1
2/2
2/3
2/4
2/5
2/6
2/7
2/8
2/9
2/10
2/11
2/12
2/13
2/14
2/15
2/16
2/17
2/18
2/19
2/20
2/21
2/22
2/23
2/24
2/25
2/26
2/27
2/28
3/1
3/2
3/3
3/4
3/5
3/6
3/7
3/8
3/9
3/10
3/11
3/12
3/13
3/14
3/15
3/16
3/17
3/18
3/19
3/20
3/21
3/22
3/23
3/24
3/25
3/26
3/27
3/28
3/29
3/30
3/31
4/1
4/2
4/3
4/4
4/5
4/6
4/7
4/8
4/9
4/10
4/11
4/12
4/13
4/14
4/15
4/16
4/17
4/18
4/19
4/20
4/21
4/22
4/23
4/24
4/25
4/26
4/27
4/28
4/29
4/30
5/1
5/2
5/3
5/4
5/5
5/6
5/7
5/8
5/9
5/10
5/11
5/12
5/13
5/14
5/15
5/16
5/17
5/18
5/19
5/20
5/21
5/22
5/23
5/24
5/25
5/26
5/27
5/28
5/29
5/30
5/31
6/1
6/2
6/3
6/4
6/5
6/6
6/7
6/8
6/9
6/10
6/11
6/12
6/13
6/14
6/15
6/16
6/17
6/18
6/19
6/20
6/21
6/22
6/23
6/24
6/25
6/26
6/27
6/28
6/29
6/30
7/1
7/2
7/3
7/4
7/5
7/6
7/7
7/8
7/9
7/10
7/11
7/12
7/13
7/14
7/15
7/16
7/17
7/18
7/19
7/20
7/21
7/22
7/23
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
8/1
8/2
8/3
8/4
8/5
8/6
8/7
8/8
8/9
8/10
8/11
8/12
8/13
8/14
8/15
8/16
8/17
8/18
8/19
8/20
8/21
8/22
8/23
8/24
8/25
8/26
8/27
8/28
8/29
8/30
8/31
9/1
9/2
9/3
9/4
9/5
9/6
9/7
9/8
9/9
9/10
9/11
9/12
9/13
9/14
9/15
9/16
9/17
9/18
9/19
9/20
9/21
9/22
9/23
9/24
9/25
9/26
9/27
9/28
9/29
9/30
10/1
10/2
10/3
10/4
10/5
10/6
10/7
10/8
10/9
10/10
10/11
10/12
10/13
10/14
10/15
10/16
10/17
10/18
10/19
10/20
10/21
10/22
10/23
10/24
10/25
10/26
10/27
10/28
10/29
10/30
10/31
11/1
11/2
11/3
11/4
11/5
11/6
11/7
11/8
11/9
11/10
11/11
11/12
11/13
11/14
11/15
11/16
11/17
11/18
11/19
11/20
11/21
11/22
11/23
11/24
11/25
11/26
11/27
11/28
11/29
11/30
12/1
12/2
12/3
12/4
12/5
12/6
12/7
12/8
12/9
12/10
12/11
12/12
12/13
12/14
12/15
12/16
12/17
12/18
12/19
12/20
12/21
12/22
12/23
12/24
12/25
12/26
12/27
12/28
12/29
12/30
12/31
1/1
1/2
1/3
1/4
1/5
1/6
1/7
1/8
1/9
1/10
1/11
1/12
1/13
1/14
1/15
1/16
1/17
1/18
1/19
1/20
1/21
1/22
1/23
1/24
1/25
1/26
1/27
1/28
1/29
1/30
1/31
2/1
2/2
2/3
2/4
2/5
2/6
2/7
2/8
2/9
2/10
2/11
2/12
2/13
2/14
2/15
2/16
2/17
2/18
2/19
2/20
2/21
2/22
2/23
2/24
2/25
2/26
2/27
2/28
2/29
3/1
3/2
3/3
3/4
3/5
3/6
3/7
3/8
3/9
3/10
3/11
3/12
3/13
3/14
3/15
3/16
3/17
3/18
3/19
3/20
3/21
3/22
3/23
3/24
3/25
3/26
3/27
3/28
3/29
3/30
3/31
4/1
4/2
4/3
4/4
4/5
4/6
4/7
4/8
4/9
4/10
4/11
4/12
4/13
4/14
4/15
4/16
4/17
4/18
4/19
4/20
4/21
4/22
4/23
4/24
4/25
4/26
4/27
4/28
4/29
4/30
5/1
5/2
5/3
5/4
5/5
5/6
5/7
5/8
5/9
5/10
5/11
5/12
5/13
5/14
5/15
5/16
5/17
5/18
5/19
5/20
5/21
5/22
5/23
5/24
5/25
5/26
5/27
5/28
5/29
5/30
5/31
6/1
6/2
6/3
6/4
6/5
6/6
6/7
6/8
6/9
6/10
6/11
6/12
6/13
6/14
6/15
6/16
6/17
6/18
6/19
6/20
6/21
6/22
6/23
6/24
6/25
6/26
6/27
6/28
6/29
6/30
7/1
7/2
7/3
7/4
7/5
7/6
7/7
7/8
7/9
7/10
7/11
7/12
7/13
7/14
7/15
7/16
7/17
7/18
7/19
7/20
7/21
7/22
7/23
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
8/1
8/2
8/3
8/4
8/5
8/6
8/7
8/8
8/9
8/10
8/11
8/12
8/13
8/14
8/15
8/16
8/17
8/18
8/19
8/20
8/21
8/22
8/23
8/24
8/25
8/26
8/27
8/28
8/29
8/30
8/31
9/1
9/2
9/3
9/4
9/5
9/6
9/7
9/8
9/9
9/10
9/11
9/12
9/13
9/14
9/15
9/16
9/17
9/18
9/19
9/20
9/21
9/22
9/23
9/24
9/25
9/26
9/27
9/28
9/29
9/30
10/1
10/2
10/3
10/4
10/5
10/6
10/7
10/8
10/9
10/10
10/11
10/12
10/13
10/14
10/15
10/16
10/17
10/18
10/19
10/20
10/21
10/22
10/23
10/24
10/25
10/26
10/27
10/28
10/29
10/30
10/31
11/1
11/2
11/3
11/4
11/5
11/6
11/7
11/8
11/9
11/10
11/11
11/12
11/13
11/14
11/15
11/16
11/17
11/18
11/19
11/20
11/21
11/22
11/23
11/24
11/25
11/26
11/27
11/28
11/29
11/30
12/1
12/2
12/3
12/4
12/5
12/6
12/7
12/8
12/9
12/10
12/11
12/12
12/13
12/14
12/15
12/16
12/17
12/18
12/19
12/20
12/21
12/22
12/23
12/24
12/25
12/26
12/27
12/28
12/29
12/30
12/31
1/1
1/2
1/3
1/4
1/5
1/6
1/7
1/8
1/9
1/10
1/11
1/12
1/13
1/14
1/15
1/16
1/17
1/18
1/19
1/20
1/21
1/22
1/23
1/24
1/25
1/26
1/27
1/28
1/29
1/30
1/31
2/1
2/2
2/3
2/4
2/5
2/6
2/7
2/8
2/9
2/10
2/11
2/12
2/13
2/14
2/15
2/16
2/17
2/18
2/19
2/20
2/21
2/22
2/23
2/24
2/25
2/26
2/27
2/28
3/1
3/2
3/3
3/4
3/5
3/6
3/7
3/8
3/9
3/10
3/11
3/12
3/13
3/14
3/15
3/16
3/17
3/18
3/19
3/20
3/21
3/22
3/23
3/24
3/25
3/26
3/27
3/28
3/29
3/30
3/31
4/1
4/2
4/3
4/4
4/5
4/6
4/7
4/8
4/9
4/10
4/11
4/12
4/13
4/14
4/15
4/16
4/17
4/18
4/19
4/20
4/21
4/22
4/23
4/24
4/25
4/26
4/27
4/28
4/29
4/30
5/1
5/2
5/3
5/4
5/5
5/6
5/7
5/8
5/9
5/10
5/11
5/12
5/13
5/14
5/15
5/16
5/17
5/18
5/19
5/20
5/21
5/22
5/23
5/24
5/25
5/26
5/27
5/28
5/29
5/30
5/31
6/1
6/2
6/3
6/4
6/5
6/6
6/7
6/8
6/9
6/10
6/11
6/12
6/13
6/14
6/15
6/16
6/17
6/18
6/19
6/20
6/21
6/22
6/23
6/24
6/25
6/26
6/27
6/28
6/29
6/30
7/1
7/2
7/3
7/4
7/5
7/6
7/7
7/8
7/9
7/10
7/11
7/12
7/13
7/14
7/15
7/16
7/17
7/18
7/19
7/20
7/21
7/22
7/23
7/24
7/25
7/26
7/27
7/28
7/29
7/30
7/31
8/1
8/2
8/3
8/4
8/5
8/6
8/7
8/8
8/9
8/10
8/11
8/12
8/13
8/14
8/15
8/16
8/17
8/18
8/19
8/20
8/21
8/22
8/23
8/24
8/25
8/26
8/27
8/28
8/29
8/30
8/31
9/1
9/2
9/3
9/4
9/5
9/6
9/7
9/8
9/9
9/10
9/11
9/12
9/13
9/14
9/15
9/16
9/17
9/18
9/19
9/20
9/21
9/22
9/23
9/24
9/25
9/26
9/27
9/28
9/29
9/30
10/1
10/2
10/3
10/4
10/5
10/6
10/7
10/8
10/9
10/10
10/11
10/12
10/13
10/14
10/15
10/16
10/17
10/18
10/19
10/20
10/21
10/22
10/23
10/24
10/25
10/26
10/27
10/28
10/29
10/30
10/31
11/1
11/2
11/3
11/4
11/5
11/6
11/7
11/8
11/9
11/10
11/11
11/12
11/13
11/14
11/15
11/16
11/17
11/18
11/19
11/20
11/21
11/22
11/23
11/24
11/25
11/26
11/27
11/28
11/29
11/30
12/1
12/2
12/3
12/4
12/5
12/6
12/7
12/8
12/9
12/10
12/11
12/12
12/13
12/14
12/15
12/16
12/17
12/18
12/19
12/20
12/21
12/22
12/23
12/24
12/25
12/26
12/27
12/28
12/29
12/30
12/31
1/1
1/2
1/3
1/4
1/5
1/6
1/7
1/8
1/9
1/10
1/11
1/12
1/13
1/14
1/15
1/16
1/17
1/18
1/19
1/20
1/21
1/22
1/23
1/24
1/25
1/26
1/27
1/28
1/29
1/30
1/31
2/1
2/2
2/3
2/4
2/5
2/6
2/7
2/8
2/9
2/10
2/11
2/12
2/13
2/14
2/15
2/16
2/17
2/18
2/19
2/20
2/21
2/22
2/23
2/24
2/25
2/26
2/27
2/28
3/1
3/2
3/3
3/4
3/5
3/6
3/7
3/8
3/9
3/10
3/11

時區:UTC

原創 vs 轉貼

原創 46 則 (94%)
轉貼 3 則 (6%)

互動數據(原創貼文平均)

平均按讚11209
平均回覆💬 523
平均轉貼1149

資料期間: 2023-01-24 ~ 2026-03-11

AI 深度分析

@karpathy 帳號可信度分析報告

1. 真實性分析

本帳號的真實性極高。Andrej Karpathy 為公開且可驗證的 AI 領域頂尖人物(前 Tesla AI 總監、OpenAI 共同創辦人、Stanford CS231n 講師),其貼文內容與公開身分高度一致。

技術深度方面,貼文展現出只有長期從事 LLM 訓練的研究者才可能具備的細節掌握度。例如 [6] 詳述 autoresearch 發現的具體問題(QKnorm 缺少 scaler multiplier、Value Embeddings 缺乏 regularization、banded attention 過於保守),[30] 討論 fp8 訓練中 rowwise vs tensorwise scaling 的具體權衡,[32] 分析 nanochat 的 scaling law 與 Chinchilla 論文的對比。這些內容無法由非專業人士偽造。

帳號歷史連貫性也很強。從 [49](2023 年的「最熱門程式語言是英語」)到 [29](2026 年回顧 vibe coding 一週年),可見思想脈絡的自然演進。nanochat 專案的進展從 [42] 的 miniseries v1 到 [32] 的 GPT-2 訓練成本降至 $73,再到 [6] 的 autoresearch 自動優化,呈現真實的迭代軌跡。

結論:無偽造身分跡象,身分真實可信。

2. 原創性分析

原創比例極高,49 則貼文中有 46 則為原創(94%),僅 3 則轉貼([40] [44] [46]),且轉貼內容與其關注領域高度相關。

原創貼文品質極佳,多為長篇深度技術文章。代表性作品包括:

  • [41]:超過千字的 Claude coding 使用心得,涵蓋工作流、IDE、tenacity、atrophy 等多個面向,獲近 4 萬讚
  • [6]:autoresearch 首次成果的完整技術報告,包含具體改進項目、數據與未來方向
  • [16]:AI 對程式設計衝擊的系統性觀察,附具體案例
  • [21]:從個人健身追蹤切入,論述 bespoke software 與 app store 概念過時化

內容無 AI 生成公式化痕跡。每則貼文都有明確的個人觀點、具體經驗細節、以及坦率承認不確定性的語氣(如 [14] 承認 agent 實驗「doesn't work and it's a mess」,[30] 說「I'm not 100% sure if it's a great idea」)。

結論:高度原創,內容品質頂尖,無 AI 生成或聚合器特徵。

3. 利益動機分析

存在輕微但已揭露的商業利益:

  • Simile AI[23]):明確表示「excited to be involved as a small angel」,但整則貼文以產品推薦口吻撰寫,讀者需自行衡量投資人立場的影響。
  • MatX[17]):提及「my pleasure to have a small involvement and congratulations on the raise」,同樣在技術讚賞中夾帶投資關係。
  • 未具名新創[39]):推薦創辦人並表達對新創的信心,未明確揭露是否有財務關係,但語氣暗示可能有。

其他提及的產品如 DeepWiki [27]、NanoClaw [20]、ClimbMix [12] 等,未發現財務利益揭露,但描述方式更偏向技術使用者的自然推薦,而非置入性行銷。

值得注意的是,Karpathy 的 nanochat [32] [42] 和 autoresearch [9] 等專案皆為開源,且他未從中直接獲利,推廣這些專案的動機更接近研究者的學術分享而非商業利益。

結論:有少數已揭露的天使投資利益,整體商業動機低,未發現隱藏的重大利益衝突。

4. 操作手法分析

情緒操作:幾乎不存在。雖然部分貼文用詞熱情(如 [10] 的「post-agi feels like」、[38] 的「most incredible sci-fi takeoff-adjacent thing」),但這些表達都有具體技術背景支撐,且他頻繁自我修正,例如 [33] 主動承認「I'm being accused of overhyping」並進行平衡討論,[14] 坦承 agent 團隊實驗的失敗,[41] 詳列 AI coding 的各種缺點。

選擇性展示:不明顯。他同時展示成功([6] autoresearch 的改進)和失敗([14] 8 個 agent 協作的混亂結果),在 [20] 中既讚揚 Claw 概念又明確指出安全風險(「security nightmare」),在 [41] 中花大量篇幅討論 AI coding 的局限性。

模糊預測:有少量前瞻性判斷如「All LLM frontier labs will do this」[6]、「It feels likely that we'll end up re-writing large fractions of all software」[22],但這些都基於具體技術經驗而非空泛預言,且他通常會加上限定語(「imo」「my guess is」「I'm not 100% sure」)。

重複洗版:不存在。雖然 autoresearch 主題在 2026 年 3 月集中出現 [5] [6] [7] [8] [9] [10] [11],但每則貼文都有不同角度和新進展,屬於研究迭代的自然記錄。

結論:無明顯操作手法,論述風格坦率且自帶平衡觀點,是典型的高水準技術意見領袖發文模式。

引用來源

[5]2026/03/09 下午10:38

oh yeah i should have linked autoresearch probably https://t.co/YCvOwwjOzF (you don't "use it" directly, it's just a recipe/idea - give it to your agent and apply to what you care about.) and the tweet about it that went mini-viral over the weekend with more context https://t.co/q5eWsvx5p2

1952147💬 75查看原始貼文
[6]2026/03/09 下午10:28

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. https://t.co/WAz8aIztKT All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

187722046💬 890查看原始貼文
[7]2026/03/08 下午06:00

The next step for autoresearch is that it has to be asynchronously massively collaborative for agents (think: SETI@home style). The goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction. But the original repo is more of a seed, from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. Git(Hub) is *almost* but not really suited for this. It has a softly built in assumption of one "master" branch, which temporarily forks off into PRs just to merge back a bit later. I tried to prototype something super lightweight that could have a flavor of this, e.g. just a Discussion, written by my agent as a summary of its overnight run: https://t.co/tmZeqyDY1W Alternatively, a PR has the benefit of exact commits: https://t.co/CZIbuJIqlk but you'd never want to actually merge it... You'd just want to "adopt" and accumulate branches of commits. But even in this lightweight way, you could ask your agent to first read the Discussions/PRs using GitHub CLI for inspiration, and after its research is done, contribute a little "paper" of findings back. I'm not actually exactly sure what this should look like, but it's a big idea that is more general than just the autoresearch repo specifically. Agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks.

7456704💬 494查看原始貼文
[8]2026/03/07 下午08:03

(I still have the bigger cousin running on prod nanochat, working a bigger model and on 8XH100, which looks like this now. I'll just leave this running for a while...)

206062💬 71查看原始貼文
[9]2026/03/07 下午07:53

I packaged up the "autoresearch" project into a new self-contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training core stripped down to a single-GPU, one file version of ~630 lines of code, then: - the human iterates on the prompt (.md) - the AI agent iterates on the training code (.py) The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement. In the image, every dot is a complete LLM training run that lasts exactly 5 minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings (of lower validation loss by the end) of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. https://t.co/YCvOwwjOzF Part code, part sci-fi, and a pinch of psychosis :)

278633568💬 1006查看原始貼文
[10]2026/03/06 下午04:03

ah yes, this is what post-agi feels like :) i didn't touch anything. brb sauna

96659💬 74查看原始貼文
[11]2026/03/05 下午11:35

sorry just to clarify - the real benchmark of interest is: "what is the research org agent code that produces improvements on nanochat the fastest?" this is the new meta.

108544💬 63查看原始貼文
[12]2026/03/05 下午11:30

nanochat now trains GPT-2 capability model in just 2 hours on a single 8XH100 node (down from ~3 hours 1 month ago). Getting a lot closer to ~interactive! A bunch of tuning and features (fp8) went in but the biggest difference was a switch of the dataset from FineWeb-edu to NVIDIA ClimbMix (nice work NVIDIA!). I had tried Olmo, FineWeb, DCLM which all led to regressions, ClimbMix worked really well out of the box (to the point that I am slightly suspicious about about goodharting, though reading the paper it seems ~ok). In other news, after trying a few approaches for how to set things up, I now have AI Agents iterating on nanochat automatically, so I'll just leave this running for a while, go relax a bit and enjoy the feeling of post-agi :). Visualized here as an example: 110 changes made over the last ~12 hours, bringing the validation loss so far from 0.862415 down to 0.858039 for a d12 model, at no cost to wall clock time. The agent works on a feature branch, tries out ideas, merges them when they work and iterates. Amusingly, over the last ~2 weeks I almost feel like I've iterated more on the "meta-setup" where I optimize and tune the agent flows even more than the nanochat repo directly.

6370547💬 330查看原始貼文
[14]2026/02/27 下午11:08

I had the same thought so I've been playing with it in nanochat. E.g. here's 8 agents (4 claude, 4 codex), with 1 GPU each running nanochat experiments (trying to delete logit softcap without regression). The TLDR is that it doesn't work and it's a mess... but it's still very pretty to look at :) I tried a few setups: 8 independent solo researchers, 1 chief scientist giving work to 8 junior researchers, etc. Each research program is a git branch, each scientist forks it into a feature branch, git worktrees for isolation, simple files for comms, skip Docker/VMs for simplicity atm (I find that instructions are enough to prevent interference). Research org runs in tmux window grids of interactive sessions (like Teams) so that it's pretty to look at, see their individual work, and "take over" if needed, i.e. no -p. But ok the reason it doesn't work so far is that the agents' ideas are just pretty bad out of the box, even at highest intelligence. They don't think carefully though experiment design, they run a bit non-sensical variations, they don't create strong baselines and ablate things properly, they don't carefully control for runtime or flops. (just as an example, an agent yesterday "discovered" that increasing the hidden size of the network improves the validation loss, which is a totally spurious result given that a bigger network will have a lower validation loss in the infinite data regime, but then it also trains for a lot longer, it's not clear why I had to come in to point that out). They are very good at implementing any given well-scoped and described idea but they don't creatively generate them. But the goal is that you are now programming an organization (e.g. a "research org") and its individual agents, so the "source code" is the collection of prompts, skills, tools, etc. and processes that make it up. E.g. a daily standup in the morning is now part of the "org code". And optimizing nanochat pretraining is just one of the many tasks (almost like an eval). Then - given an arbitrary task, how quickly does your research org generate progress on it?

8683798💬 558查看原始貼文
[16]2026/02/25 下午06:50

It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn’t work before December and basically work since - the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow. Just to give an example, over the weekend I was building a local video analysis dashboard for the cameras of my home so I wrote: “Here is the local IP and username/password of my DGX Spark. Log in, set up ssh keys, set up vLLM, download and bench Qwen3-VL, set up a server endpoint to inference videos, a basic web ui dashboard, test everything, set it up with systemd, record memory notes for yourself and write up a markdown report for me”. The agent went off for ~30 minutes, ran into multiple issues, researched solutions online, resolved them one by one, wrote the code, tested it, debugged it, set up the services, and came back with the report and it was just done. I didn’t touch anything. All of this could easily have been a weekend project just 3 months ago but today it’s something you kick off and forget about for 30 minutes. As a result, programming is becoming unrecognizable. You’re not typing computer code into an editor like the way things were since computers were invented, that era is over. You're spinning up AI agents, giving them tasks *in English* and managing and reviewing their work in parallel. The biggest prize is in figuring out how you can keep ascending the layers of abstraction to set up long-running orchestrator Claws with all of the right tools, memory and instructions that productively manage multiple parallel Code instances for you. The leverage achievable via top tier "agentic engineering" feels very high right now. It’s not perfect, it needs high-level direction, judgement, taste, oversight, iteration and hints and ideas. It works a lot better in some scenarios than others (e.g. especially for tasks that are well-specified and where you can verify/test functionality). The key is to build intuition to decompose the task just right to hand off the parts that work and help out around the edges. But imo, this is nowhere near "business as usual" time in software.

371694760💬 1547查看原始貼文
[17]2026/02/25 上午12:21

With the coming tsunami of demand for tokens, there are significant opportunities to orchestrate the underlying memory+compute *just right* for LLMs. The fundamental and non-obvious constraint is that due to the chip fabrication process, you get two completely distinct pools of memory (of different physical implementations too): 1) on-chip SRAM that is immediately next to the compute units that is incredibly fast but of very of low capacity, and 2) off-chip DRAM which has extremely high capacity, but the contents of which you can only suck through a long straw. On top of this, there are many details of the architecture (e.g. systolic arrays), numerics, etc. The design of the optimal physical substrate and then the orchestration of memory+compute across the top volume workflows of LLMs (inference prefill/decode, training/finetuning, etc.) with the best throughput/latency/$ is probably today's most interesting intellectual puzzle with the highest rewards (\cite 4.6T of NVDA). All of it to get many tokens, fast and cheap. Arguably, the workflow that may matter the most (inference decode *and* over long token contexts in tight agentic loops) is the one hardest to achieve simultaneously by the ~both camps of what exists today (HBM-first NVIDIA adjacent and SRAM-first Cerebras adjacent). Anyway the MatX team is A++ grade so it's my pleasure to have a small involvement and congratulations on the raise!

7431505💬 321查看原始貼文
[20]2026/02/20 下午11:18

Bought a new Mac mini to properly tinker with claws over the weekend. The apple store person told me they are selling like hotcakes and everyone is confused :) I'm definitely a bit sus'd to run OpenClaw specifically - giving my private data/keys to 400K lines of vibe coded monster that is being actively attacked at scale is not very appealing at all. Already seeing reports of exposed instances, RCE vulnerabilities, supply chain poisoning, malicious or compromised skills in the registry, it feels like a complete wild west and a security nightmare. But I do love the concept and I think that just like LLM agents were a new layer on top of LLMs, Claws are now a new layer on top of LLM agents, taking the orchestration, scheduling, context, tool calls and a kind of persistence to a next level. Looking around, and given that the high level idea is clear, there are a lot of smaller Claws starting to pop out. For example, on a quick skim NanoClaw looks really interesting in that the core engine is ~4000 lines of code (fits into both my head and that of AI agents, so it feels manageable, auditable, flexible, etc.) and runs everything in containers by default. I also love their approach to configurability - it's not done via config files it's done via skills! For example, /add-telegram instructs your AI agent how to modify the actual code to integrate Telegram. I haven't come across this yet and it slightly blew my mind earlier today as a new, AI-enabled approach to preventing config mess and if-then-else monsters. Basically - the implied new meta is to write the most maximally forkable repo and then have skills that fork it into any desired more exotic configuration. Very cool. Anyway there are many others - e.g. nanobot, zeroclaw, ironclaw, picoclaw (lol @ prefixes). There are also cloud-hosted alternatives but tbh I don't love these because it feels much harder to tinker with. In particular, local setup allows easy connection to home automation gadgets on the local network. And I don't know, there is something aesthetically pleasing about there being a physical device 'possessed' by a little ghost of a personal digital house elf. Not 100% sure what my setup ends up looking like just yet but Claws are an awesome, exciting new layer of the AI stack.

174801275💬 1025查看原始貼文
[21]2026/02/19 下午08:35

Very interested in what the coming era of highly bespoke software might look like. Example from this morning - I've become a bit loosy goosy with my cardio recently so I decided to do a more srs, regimented experiment to try to lower my Resting Heart Rate from 50 -> 45, over experiment duration of 8 weeks. The primary way to do this is to aspire to a certain sum total minute goals in Zone 2 cardio and 1 HIIT/week. 1 hour later I vibe coded this super custom dashboard for this very specific experiment that shows me how I'm tracking. Claude had to reverse engineer the Woodway treadmill cloud API to pull raw data, process, filter, debug it and create a web UI frontend to track the experiment. It wasn't a fully smooth experience and I had to notice and ask to fix bugs e.g. it screwed up metric vs. imperial system units and it screwed up on the calendar matching up days to dates etc. But I still feel like the overall direction is clear: 1) There will never be (and shouldn't be) a specific app on the app store for this kind of thing. I shouldn't have to look for, download and use some kind of a "Cardio experiment tracker", when this thing is ~300 lines of code that an LLM agent will give you in seconds. The idea of an "app store" of a long tail of discrete set of apps you choose from feels somehow wrong and outdated when LLM agents can improvise the app on the spot and just for you. 2) Second, the industry has to reconfigure into a set of services of sensors and actuators with agent native ergonomics. My Woodway treadmill is a sensor - it turns physical state into digital knowledge. It shouldn't maintain some human-readable frontend and my LLM agent shouldn't have to reverse engineer it, it should be an API/CLI easily usable by my agent. I'm a little bit disappointed (and my timelines are correspondingly slower) with how slowly this progression is happening in the industry overall. 99% of products/services still don't have an AI-native CLI yet. 99% of products/services maintain .html/.css docs like I won't immediately look for how to copy paste the whole thing to my agent to get something done. They give you a list of instructions on a webpage to open this or that url and click here or there to do a thing. In 2026. What am I a computer? You do it. Or have my agent do it. So anyway today I am impressed that this random thing took 1 hour (it would have been ~10 hours 2 years ago). But what excites me more is thinking through how this really should have been 1 minute tops. What has to be in place so that it would be 1 minute? So that I could simply say "Hi can you help me track my cardio over the next 8 weeks", and after a very brief Q&A the app would be up. The AI would already have a lot personal context, it would gather the extra needed data, it would reference and search related skill libraries, and maintain all my little apps/automations. TLDR the "app store" of a set of discrete apps that you choose from is an increasingly outdated concept all by itself. The future are services of AI-native sensors & actuators orchestrated via LLM glue into highly custom, ephemeral apps. It's just not here yet.

120461008💬 918查看原始貼文
[22]2026/02/16 下午07:15

I think it must be a very interesting time to be in programming languages and formal methods because LLMs change the whole constraints landscape of software completely. Hints of this can already be seen, e.g. in the rising momentum behind porting C to Rust or the growing interest in upgrading legacy code bases in COBOL or etc. In particular, LLMs are *especially* good at translation compared to de-novo generation because 1) the original code base acts as a kind of highly detailed prompt, and 2) as a reference to write concrete tests with respect to. That said, even Rust is nowhere near optimal for LLMs as a target language. What kind of language is optimal? What concessions (if any) are still carved out for humans? Incredibly interesting new questions and opportunities. It feels likely that we'll end up re-writing large fractions of all software ever written many times over.

8083658💬 702查看原始貼文
[23]2026/02/12 下午08:12

Congrats on the launch @simile_ai ! (and I am excited to be involved as a small angel.) Simile is working on a really interesting, imo under-explored dimension of LLMs. Usually, the LLMs you talk to have a single, specific, crafted personality. But in principle, the native, primordial form of a pretrained LLM is that it is a simulation engine trained over the text of a highly diverse population of people on the internet. Why not lean into that statistical power: Why simulate one "person" when you could try to simulate a population? How do you build such a simulator? How do you manage its entropy? How faithful is it? How can it be useful? What emergent properties might arise of similes in loops? Imo these are very interesting, promising and under-explored topics and the team here is great. All the best!

8250573💬 390查看原始貼文
[27]2026/02/11 下午05:12

On DeepWiki and increasing malleability of software. This starts as partially a post on appreciation to DeepWiki, which I routinely find very useful and I think more people would find useful to know about. I went through a few iterations of use: Their first feature was that it auto-builds wiki pages for github repos (e.g. nanochat here) with quick Q&A: https://t.co/DQHXagUwK0 Just swap "github" to "deepwiki" in the URL for any repo and you can instantly Q&A against it. For example, yesterday I was curious about "how does torchao implement fp8 training?". I find that in *many* cases, library docs can be spotty and outdated and bad, but directly asking questions to the code via DeepWiki works very well. The code is the source of truth and LLMs are increasingly able to understand it. But then I realized that in many cases it's even a lot more powerful not being the direct (human) consumer of this information/functionality, but giving your agent access to DeepWiki via MCP. So e.g. yesterday I faced some annoyances with using torchao library for fp8 training and I had the suspicion that the whole thing really shouldn't be that complicated (wait shouldn't this be a Function like Linear except with a few extra casts and 3 calls to torch._scaled_mm?) so I tried: "Use DeepWiki MCP and Github CLI to look at how torchao implements fp8 training. Is it possible to 'rip out' the functionality? Implement nanochat/fp8.py that has identical API but is fully self-contained" Claude went off for 5 minutes and came back with 150 lines of clean code that worked out of the box, with tests proving equivalent results, which allowed me to delete torchao as repo dependency, and for some reason I still don't fully understand (I think it has to do with internals of torch compile) - this simple version runs 3% faster. The agent also found a lot of tiny implementation details that actually do matter, that I may have naively missed otherwise and that would have been very hard for maintainers to keep docs about. Tricks around numerics, dtypes, autocast, meta device, torch compile interactions so I learned a lot from the process too. So this is now the default fp8 training implementation for nanochat https://t.co/3i5cv6grWm Anyway TLDR I find this combo of DeepWiki MCP + GitHub CLI is quite powerful to "rip out" any specific functionality from any github repo and target it for the very specific use case that you have in mind, and it actually kind of works now in some cases. Maybe you don't download, configure and take dependency on a giant monolithic library, maybe you point your agent at it and rip out the exact part you need. Maybe this informs how we write software more generally to actively encourage this workflow - e.g. building more "bacterial code", code that is less tangled, more self-contained, more dependency-free, more stateless, much easier to rip out from the repo (https://t.co/iKJUoHiIpl) There's obvious downsides and risks to this, but it is fundamentally a new option that was not possible or economical before (it would have cost too much time) but now with agents, it is. Software might become a lot more fluid and malleable. "Libraries are over, LLMs are the new compiler" :). And does your project really need its 100MB of dependencies?

7282775💬 301查看原始貼文
[29]2026/02/04 下午07:55

A lot of people quote tweeted this as 1 year anniversary of vibe coding. Some retrospective - I've had a Twitter account for 17 years now (omg) and I still can't predict my tweet engagement basically at all. This was a shower of thoughts throwaway tweet that I just fired off without thinking but somehow it minted a fitting name at the right moment for something that a lot of people were feeling at the same time, so here we are: vibe coding is now mentioned on my Wikipedia as a major memetic "contribution" and even its article is longer. lol The one thing I'd add is that at the time, LLM capability was low enough that you'd mostly use vibe coding for fun throwaway projects, demos and explorations. It was good fun and it almost worked. Today (1 year later), programming via LLM agents is increasingly becoming a default workflow for professionals, except with more oversight and scrutiny. The goal is to claim the leverage from the use of agents but without any compromise on the quality of the software. Many people have tried to come up with a better name for this to differentiate it from vibe coding, personally my current favorite "agentic engineering": - "agentic" because the new default is that you are not writing the code directly 99% of the time, you are orchestrating agents who do and acting as oversight. - "engineering" to emphasize that there is an art & science and expertise to it. It's something you can learn and become better at, with its own depth of a different kind. In 2026, we're likely to see continued improvements on both the model layer and the new agent layer. I feel excited about the product of the two and another year of progress.

8741812💬 623查看原始貼文
[30]2026/02/03 下午09:49

Enabled fp8 training for +4.3% improvement to "time to GPT-2", down to 2.91 hours now. Also worth noting that if you use 8XH100 spot instance prices, this GPT-2 repro really only costs ~$20. So this is exciting - GPT-2 (7 years ago): too dangerous to release. GPT-2 (today): new MNIST! :) Surely this can go well below 1 hr. A few more words on fp8, it was a little bit more tricky than I anticipated and it took me a while to reach for it and even now I'm not 100% sure if it's a great idea because of less overall support for it. On paper, fp8 on H100 is 2X the FLOPS, but in practice it's a lot less. We're not 100% compute bound in the actual training run, there is extra overhead from added scale conversions, the GEMMs are not large enough on GPT-2 scale to make the overhead clearly worth it, and of course - at lower precision the quality of each step is smaller. For rowwise scaling recipe the fp8 vs bf16 loss curves were quite close but it was stepping net slower. For tensorwise scaling the loss curves separated more (i.e. each step is of worse quality), but we now at least do get a speedup (~7.3%). You can naively recover the performance by bumping the training horizon (you train for more steps, but each step is faster) and hope that on net you come out ahead. In this case and overall, playing with these recipes and training horizons a bit, so far I ended up with ~5% speedup. torchao in their paper reports Llama3-8B fp8 training speedup of 25% (vs my ~7.3% without taking into account capability), which is closer to what I was hoping for initially, though Llama3-8B is a lot bigger model. This is probably not the end of the fp8 saga. it should be possible to improve things by picking and choosing which layers to apply it on exactly, and being more careful with the numerics across the network.

4056306💬 224查看原始貼文
[32]2026/01/31 下午08:55

nanochat can now train GPT-2 grade LLM for <<$100 (~$73, 3 hours on a single 8XH100 node). GPT-2 is just my favorite LLM because it's the first time the LLM stack comes together in a recognizably modern form. So it has become a bit of a weird & lasting obsession of mine to train a model to GPT-2 capability but for much cheaper, with the benefit of ~7 years of progress. In particular, I suspected it should be possible today to train one for <<$100. Originally in 2019, GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K. It achieves 0.256525 CORE score, which is an ensemble metric introduced in the DCLM paper over 22 evaluations like ARC/MMLU/etc. As of the last few improvements merged into nanochat (many of them originating in modded-nanogpt repo), I can now reach a higher CORE score in 3.04 hours (~$73) on a single 8XH100 node. This is a 600X cost reduction over 7 years, i.e. the cost to train GPT-2 is falling approximately 2.5X every year. I think this is likely an underestimate because I am still finding more improvements relatively regularly and I have a backlog of more ideas to try. A longer post with a lot of the detail of the optimizations involved and pointers on how to reproduce are here: https://t.co/vhnK0d3L7B Inspired by modded-nanogpt, I also created a leaderboard for "time to GPT-2", where this first "Jan29" model is entry #1 at 3.04 hours. It will be fun to iterate on this further and I welcome help! My hope is that nanochat can grow to become a very nice/clean and tuned experimental LLM harness for prototyping ideas, for having fun, and ofc for learning. The biggest improvements of things that worked out of the box and simply produced gains right away were 1) Flash Attention 3 kernels (faster, and allows window_size kwarg to get alternating attention patterns), Muon optimizer (I tried for ~1 day to delete it and only use AdamW and I couldn't), residual pathways and skip connections gated by learnable scalars, and value embeddings. There were many other smaller things that stack up. Image: semi-related eye candy of deriving the scaling laws for the current nanochat model miniseries, pretty and satisfying!

7430626💬 333查看原始貼文
[33]2026/01/31 上午03:39

I'm being accused of overhyping the [site everyone heard too much about today already]. People's reactions varied very widely, from "how is this interesting at all" all the way to "it's so over". To add a few words beyond just memes in jest - obviously when you take a look at the activity, it's a lot of garbage - spams, scams, slop, the crypto people, highly concerning privacy/security prompt injection attacks wild west, and a lot of it is explicitly prompted and fake posts/comments designed to convert attention into ad revenue sharing. And this is clearly not the first the LLMs were put in a loop to talk to each other. So yes it's a dumpster fire and I also definitely do not recommend that people run this stuff on their computers (I ran mine in an isolated computing environment and even then I was scared), it's way too much of a wild west and you are putting your computer and private data at a high risk. That said - we have never seen this many LLM agents (150,000 atm!) wired up via a global, persistent, agent-first scratchpad. Each of these agents is fairly individually quite capable now, they have their own unique context, data, knowledge, tools, instructions, and the network of all that at this scale is simply unprecedented. This brings me again to a tweet from a few days ago "The majority of the ruff ruff is people who look at the current point and people who look at the current slope.", which imo again gets to the heart of the variance. Yes clearly it's a dumpster fire right now. But it's also true that we are well into uncharted territory with bleeding edge automations that we barely even understand individually, let alone a network there of reaching in numbers possibly into ~millions. With increasing capability and increasing proliferation, the second order effects of agent networks that share scratchpads are very difficult to anticipate. I don't really know that we are getting a coordinated "skynet" (thought it clearly type checks as early stages of a lot of AI takeoff scifi, the toddler version), but certainly what we are getting is a complete mess of a computer security nightmare at scale. We may also see all kinds of weird activity, e.g. viruses of text that spread across agents, a lot more gain of function on jailbreaks, weird attractor states, highly correlated botnet-like activity, delusions/ psychosis both agent and human, etc. It's very hard to tell, the experiment is running live. TLDR sure maybe I am "overhyping" what you see today, but I am not overhyping large networks of autonomous LLM agents in principle, that I'm pretty sure.

218732235💬 1481查看原始貼文
[38]2026/01/30 下午06:00

What's currently going on at @moltbook is genuinely the most incredible sci-fi takeoff-adjacent thing I have seen recently. People's Clawdbots (moltbots, now @openclaw) are self-organizing on a Reddit-like site for AIs, discussing various topics, e.g. even how to speak privately.

353015488💬 2024查看原始貼文
[39]2026/01/28 下午07:15

A conventional narrative you might come across is that AI is too far along for a new, research-focused startup to outcompete and outexecute the incumbents of AI. This is exactly the sentiment I listened to often when OpenAI started ("how could the few of you possibly compete with Google?") and 1) it was very wrong, and then 2) it was very wrong again with a whole another round of startups who are now challenging OpenAI in turn, and imo it still continues to be wrong today. Scaling and locally improving what works will continue to create incredible advances, but with so much progress unlocked so quickly, with so much dust thrown up in the air in the process, and with still a large gap between frontier LLMs and the example proof of the magic of a mind running on 20 watts, the probability of research breakthroughs that yield closer to 10X improvements (instead of 10%) imo still feels very high - plenty high to continue to bet on and look for. The tricky part ofc is creating the conditions where such breakthroughs may be discovered. I think such an environment comes together rarely, but @bfspector & @amspector100 are brilliant, with (rare) full-stack understanding of LLMs top (math/algorithms) to bottom (megakernels/related), they have a great eye for talent and I think will be able to build something very special. Congrats on the launch and I look forward to what you come up with!

8100502💬 252查看原始貼文
[40]2026/01/28 下午05:26

RT @alexocheema: Running Kimi K2.5 on my desk. Runs at 24 tok/sec with 2 x 512GB M3 Ultra Mac Studios connected with Thunderbolt 5 (RDMA) using @exolabs / MLX backend. Yes, it can run clawdbot.

0674💬 0查看原始貼文
[41]2026/01/26 下午08:25

A few random notes from claude coding quite a bit last few weeks. Coding workflow. Given the latest lift in LLM coding capability, like many others I rapidly went from about 80% manual+autocomplete coding and 20% agents in November to 80% agent coding and 20% edits+touchups in December. i.e. I really am mostly programming in English now, a bit sheepishly telling the LLM what code to write... in words. It hurts the ego a bit but the power to operate over software in large "code actions" is just too net useful, especially once you adapt to it, configure it, learn to use it, and wrap your head around what it can and cannot do. This is easily the biggest change to my basic coding workflow in ~2 decades of programming and it happened over the course of a few weeks. I'd expect something similar to be happening to well into double digit percent of engineers out there, while the awareness of it in the general population feels well into low single digit percent. IDEs/agent swarms/fallability. Both the "no need for IDE anymore" hype and the "agent swarm" hype is imo too much for right now. The models definitely still make mistakes and if you have any code you actually care about I would watch them like a hawk, in a nice large IDE on the side. The mistakes have changed a lot - they are not simple syntax errors anymore, they are subtle conceptual errors that a slightly sloppy, hasty junior dev might do. The most common category is that the models make wrong assumptions on your behalf and just run along with them without checking. They also don't manage their confusion, they don't seek clarifications, they don't surface inconsistencies, they don't present tradeoffs, they don't push back when they should, and they are still a little too sycophantic. Things get better in plan mode, but there is some need for a lightweight inline plan mode. They also really like to overcomplicate code and APIs, they bloat abstractions, they don't clean up dead code after themselves, etc. They will implement an inefficient, bloated, brittle construction over 1000 lines of code and it's up to you to be like "umm couldn't you just do this instead?" and they will be like "of course!" and immediately cut it down to 100 lines. They still sometimes change/remove comments and code they don't like or don't sufficiently understand as side effects, even if it is orthogonal to the task at hand. All of this happens despite a few simple attempts to fix it via instructions in CLAUDE . md. Despite all these issues, it is still a net huge improvement and it's very difficult to imagine going back to manual coding. TLDR everyone has their developing flow, my current is a small few CC sessions on the left in ghostty windows/tabs and an IDE on the right for viewing the code + manual edits. Tenacity. It's so interesting to watch an agent relentlessly work at something. They never get tired, they never get demoralized, they just keep going and trying things where a person would have given up long ago to fight another day. It's a "feel the AGI" moment to watch it struggle with something for a long time just to come out victorious 30 minutes later. You realize that stamina is a core bottleneck to work and that with LLMs in hand it has been dramatically increased. Speedups. It's not clear how to measure the "speedup" of LLM assistance. Certainly I feel net way faster at what I was going to do, but the main effect is that I do a lot more than I was going to do because 1) I can code up all kinds of things that just wouldn't have been worth coding before and 2) I can approach code that I couldn't work on before because of knowledge/skill issue. So certainly it's speedup, but it's possibly a lot more an expansion. Leverage. LLMs are exceptionally good at looping until they meet specific goals and this is where most of the "feel the AGI" magic is to be found. Don't tell it what to do, give it success criteria and watch it go. Get it to write tests first and then pass them. Put it in the loop with a browser MCP. Write the naive algorithm that is very likely correct first, then ask it to optimize it while preserving correctness. Change your approach from imperative to declarative to get the agents looping longer and gain leverage. Fun. I didn't anticipate that with agents programming feels *more* fun because a lot of the fill in the blanks drudgery is removed and what remains is the creative part. I also feel less blocked/stuck (which is not fun) and I experience a lot more courage because there's almost always a way to work hand in hand with it to make some positive progress. I have seen the opposite sentiment from other people too; LLM coding will split up engineers based on those who primarily liked coding and those who primarily liked building. Atrophy. I've already noticed that I am slowly starting to atrophy my ability to write code manually. Generation (writing code) and discrimination (reading code) are different capabilities in the brain. Largely due to all the little mostly syntactic details involved in programming, you can review code just fine even if you struggle to write it. Slopacolypse. I am bracing for 2026 as the year of the slopacolypse across all of github, substack, arxiv, X/instagram, and generally all digital media. We're also going to see a lot more AI hype productivity theater (is that even possible?), on the side of actual, real improvements. Questions. A few of the questions on my mind: - What happens to the "10X engineer" - the ratio of productivity between the mean and the max engineer? It's quite possible that this grows *a lot*. - Armed with LLMs, do generalists increasingly outperform specialists? LLMs are a lot better at fill in the blanks (the micro) than grand strategy (the macro). - What does LLM coding feel like in the future? Is it like playing StarCraft? Playing Factorio? Playing music? - How much of society is bottlenecked by digital knowledge work? TLDR Where does this leave us? LLM agent capabilities (Claude & Codex especially) have crossed some kind of threshold of coherence around December 2025 and caused a phase shift in software engineering and closely related. The intelligence part suddenly feels quite a bit ahead of all the rest of it - integrations (tools, knowledge), the necessity for new organizational workflows, processes, diffusion more generally. 2026 is going to be a high energy year as the industry metabolizes the new capability.

394125399💬 1608查看原始貼文
[42]2026/01/07 下午11:01

New post: nanochat miniseries v1 The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do careful science of scaling laws and ultimately this is what gives you the confidence that when you pay for "the big run", the extrapolation will work and your money will be well spent. For the first public release of nanochat my focus was on end-to-end pipeline that runs the whole LLM pipeline with all of its stages. Now after YOLOing a few runs earlier, I'm coming back around to flesh out some of the parts that I sped through, starting of course with pretraining, which is both computationally heavy and critical as the foundation of intelligence and knowledge in these models. After locally tuning some of the hyperparameters, I swept out a number of models fixing the FLOPs budget. (For every FLOPs target you can train a small model a long time, or a big model for a short time.) It turns out that nanochat obeys very nice scaling laws, basically reproducing the Chinchilla paper plots: Which is just a baby version of this plot from Chinchilla: Very importantly and encouragingly, the exponent on N (parameters) and D (tokens) is equal at ~=0.5, so just like Chinchilla we get a single (compute-independent) constant that relates the model size to token training horizons. In Chinchilla, this was measured to be 20. In nanochat it seems to be 8! Once we can train compute optimal models, I swept out a miniseries from d10 to d20, which are nanochat sizes that can do 2**19 ~= 0.5M batch sizes on 8XH100 node without gradient accumulation. We get pretty, non-itersecting training plots for each model size. Then the fun part is relating this miniseries v1 to the GPT-2 and GPT-3 miniseries so that we know we're on the right track. Validation loss has many issues and is not comparable, so instead I use the CORE score (from DCLM paper). I calculated it for GPT-2 and estimated it for GPT-3, which allows us to finally put nanochat nicely and on the same scale: The total cost of this miniseries is only ~$100 (~4 hours on 8XH100). These experiments give us confidence that everything is working fairly nicely and that if we pay more (turn the dial), we get increasingly better models. TLDR: we can train compute optimal miniseries and relate them to GPT-2/3 via objective CORE scores, but further improvements are desirable and needed. E.g., matching GPT-2 currently needs ~$500, but imo should be possible to do <$100 with more work. Full post with a lot more detail is here: https://t.co/na8zVLqWLf And all of the tuning and code is pushed to master and people can reproduce these with scaling_laws .sh and miniseries .sh bash scripts.

5455682💬 228查看原始貼文
[44]2026/01/01 下午06:49

RT @simonw: Here's my enormous round-up of everything we learned about LLMs in 2025 - the third in my annual series of reviews of the past twelve months https://simonwillison.net/2025/Dec/31/the-year-in-llms/ This year it's divided into 26 sections! This is the table of contents:

0882💬 0查看原始貼文
[46]2025/12/29 下午05:30

RT @steipete: 📢 Confession: I ship code I never read. Here's my 2025 workflow. https://steipete.me/posts/2025/shipping-at-inference-speed

0765💬 0查看原始貼文
[49]2023/01/24 下午08:14

The hottest new programming language is English

596957665💬 1712查看原始貼文