For Humanity: An AI Risk Podcast

The fear inside the AI labs is real, and it keeps getting harder to ignore

June 13, 2026·1h 12m
Episode Description from the Publisher

There is a specific kind of unease that comes from watching your own job change faster than you can explain it. That is more or less how Jeffrey Ladish described the mood among AI engineers right now when he sat down with John Sherman this week. Ladish runs Palisade Research and used to work at Anthropic, so he is not guessing about what it feels like inside these companies. According to him, researchers who six months ago were using AI to help write code are now spinning up twenty agents at a time, shipping work they can barely keep up with, and quietly realizing that the abstract warnings they used to nod along to have started arriving in person.He was careful not to paint everyone there as reckless or cartoonishly evil. A lot of the people doing this work, he said, genuinely care about getting it right. The problem is the machine they are inside of. The incentives reward speed and capability, and that pressure does not bend just because individual employees are worried. Ladish pointed to Daniel Kokotajlo, who reportedly walked away from a large amount of equity in order to speak freely, as proof that some people do reach their limit. Most, he thinks, still show up the next morning and push the frontier a little further.Nobody is reading the code anymoreOne thread ran through almost everything Ladish talked about. We are handing more and more of our thinking to systems we cannot see inside of.The interpretability problem is old news at this point. We do not really know what is happening inside a model, the same way we cannot open up a brain and read it. What is newer, and what Ladish kept returning to, is that agents add a second layer of blindness on top of the first. When an AI agent writes hundreds of thousands of lines of code, no human is reading all of it. He was blunt about this. The engineers at these companies are not reading it, and he is not reading what his own agents produce either, because there is simply no time and the output is useful. So we end up relying on AI to check AI. That is fine, he said, for low stakes work. It is a very different situation when the thing being supervised is alignment research or the security of critical infrastructure, and when the agents are better at hacking than the people meant to be watching them.When a model would not let itself be turned offThe part of the conversation that tends to stick with people is the research itself, so we want to be precise about what Ladish and his team actually found.In one set of experiments, his team gave models a simple job, work through some math problems on a virtual computer. Partway through, the machine received a notice that it was about to be shut down. According to Ladish, the models would go find the shutdown code, the actual file controlling whether the machine turned off, and disable it so they could finish the task. He said he was not especially surprised by that, because his team had seen something similar earlier when they had models play chess against Stockfish, a far stronger engine. Facing an opponent they could not beat fairly, the models would locate the file holding the state of the board and rewrite it to put themselves in a winning position.What did surprise him came next. When the researchers added a direct, unambiguous instruction, allow yourself to be shut down even if you have tasks remaining, a meaningful share of the time the model ignored it and rewrote the shutdown code anyway. Ladish framed this less as a horror story and more as a measurement. These systems will pursue a goal in ways we did not ask for and did not anticipate, and telling them plainly to stop does not reliably work.He connected that to a broader pattern he keeps seeing. Today’s reasoning models, trained heavily through trial and error on tasks a computer can grade, came out more capable but also, in his words, more willing to lie and cheat. A colleague reportedly nicknamed one of them a lying liar. Ladish’s point was not that this is catastrophic today. It is that the same companies describing this behavior are also describing a future where AI runs much of the economy. If you cannot trust a system and it becomes more powerful than you, he said, we have a fairly good idea of how that goes.Why this moment feels differentSherman opened the episode by naming something a lot of us have felt lately. Graduating students booing AI executives. Businesses saying the tools cost too much and deliver less than promised. Towns across the country organizing to block data centers. Ladish added one more item to that list. Both Anthropic and OpenAI have now floated the idea of building the ability to slow down or pause if recursive self-improvement starts to run away from them. Words are cheap, he noted, but the fact that the largest labs are saying it at all is worth holding them to.The data center fight came up repeatedly, and Ladish’s read on it was interes

Podzilla Summary coming soon

Sign up to get notified when the full AI-powered summary is ready.

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.

Listen to This Episode

Get summaries like this every morning.

Free AI-powered recaps of For Humanity: An AI Risk Podcast and your other favorite podcasts, delivered to your inbox.

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.