Inside the Machine: Training GPT-5, the Memory Wall, and the Math of MoE

May 1, 2026·45 min

Episode Description from the Publisher

How are the world's most advanced models-GPT-5, Claude, and Gemini-actually trained and served at scale? In this deep dive, we move to the blackboard to quantify the ML infrastructure that makes AI progress possible. Drawing on the expertise of Reiner Pope (formerly of Google TPU architecture), we analyze the dimensionless hardware constants (approx. 300 for most GPUs) that dictate optimal batch sizes and sparsity ratios.Key topics covered in this episode:The 20ms Rule: Why memory capacity and bandwidth force a specific schedule on GPU operations.The Scaling of Sparsity: How DeepSeek’s mixture of experts (MoE) uses "finer-grained" experts to beat the compute bottleneck.Physical Constraints: Why the "Memory Wall" is often a literal problem of cable density and bend radius inside a rack.Training vs. Inference: Why models are now being "over-trained" up to 100x the Chinchilla optimal to save on massive inference costs later.The Future of Context: Why we are currently stuck at 200k context lengths and what it will take to reach the 100-million-token employee.Follow us on X/Twitter: @neuralintelorg Stay updated at: neuralintel.org

AI Summary coming soon

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.