How are the world's most advanced models-GPT-5, Claude, and Gemini-actually trained and served at scale? In this deep dive, we move to the blackboard to quantify the ML infrastructure that makes AI progress possible. Drawing on the expertise of Reiner Pope (formerly of Google TPU architecture), we analyze the dimensionless hardware constants (approx. 300 for most GPUs) that dictate optimal batch sizes and sparsity ratios.Key topics covered in this episode:The 20ms Rule: Why memory capacity and bandwidth force a specific schedule on GPU operations.The Scaling of Sparsity: How DeepSeek’s mixture of experts (MoE) uses "finer-grained" experts to beat the compute bottleneck.Physical Constraints: Why the "Memory Wall" is often a literal problem of cable density and bend radius inside a rack.Training vs. Inference: Why models are now being "over-trained" up to 100x the Chinchilla optimal to save on massive inference costs later.The Future of Context: Why we are currently stuck at 200k context lengths and what it will take to reach the 100-million-token employee.Follow us on X/Twitter: @neuralintelorg Stay updated at: neuralintel.org
AI Summary coming soon
Sign up to get notified when the full AI-powered summary is ready.
Free forever for up to 3 podcasts. No credit card required.
DeepSeek-V4: The Million-Token Efficiency Leap | Open Source SOTA
Breaking the Quadratic Bottleneck with DeepSeek-V4’s Hybrid Attention
Claude Desktop’s Silent Sandbox Bypass: The Undocumented Browser Bridge
Forensic Audit of Anthropic’s Native Messaging Backdoor
Free AI-powered recaps of Neural intel Pod and your other favorite podcasts, delivered to your inbox.
Free forever for up to 3 podcasts. No credit card required.