Why AI Evaluation Science Can't Keep Up (with Carina Prunkl)

April 17, 2026·54 min

Episode Description from the Publisher

Carina Prunkl is a researcher at Inria. She joins the podcast to discuss how to assess the capabilities and risks of general-purpose AI. We examine why systems can solve hard coding and math problems yet still fail at simple tasks, why pre-deployment tests often miss real-world behavior, and how faster capability gains can increase misuse risks. The conversation also covers de-skilling, red teaming, layered safeguards, and warning signs that AIs might undermine oversight.LINKS:Carina Prunkl personal websiteCHAPTERS: (00:00) Episode Preview (01:04) Introducing the report (02:10) Jagged frontier capabilities (05:29) Formal reasoning progress (12:36) Risks and evaluation science (19:00) Funding evaluation capacity (24:03) Autonomy and de-skilling (31:32) Authenticity and AI companions (41:00) Defense in depth methods (48:34) Loss of control risks (53:16) Where to read report PRODUCED BY: https://aipodcast.ing SOCIAL LINKS: Website: https://podcast.futureoflife.org Twitter (FLI): https://x.com/FLI_org Twitter (Gus): https://x.com/gusdocker LinkedIn: https://www.linkedin.com/company/future-of-life-institute/ YouTube: https://www.youtube.com/channel/UC-rCCy3FQ-GItDimSR9lhzw/ Apple: https://geo.itunes.apple.com/us/podcast/id1170991978 Spotify: https://open.spotify.com/show/2Op1WO3gwVwCrYHg4eoGyP

Podzilla Summary coming soon

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.