Daily Paper Cast Podcast Summary — Free Daily Recap

Latest Episodes

The most recent episodes — sign up to get AI-powered summaries of each one.

Yesterday20 min
TurboVLA: Real-Time Vision-Language-Action Model at 32 Hz on an RTX 4090 with <1 GB VRAM
🤗 Upvotes: 122 | cs.CV, cs.RO Authors: Hengyi Xie, Chenfei Yao, Xianjin Wu, Xuanyang Xi, Yiping Tang, Di Xu, Yingying Zhu, Dingkang Liang, Xiang Bai, Han Ding Title: TurboVLA: Real-Time Vision-Language-Action Model at 32 Hz on an RTX 4090 with <1 GB VRAM Arxiv: http://arxiv.org/abs/2607.27205v1 Abstract: Vision-language-action (VLA) models commonly adopt an LLM-centric $V \to L \to A$ pathway, where visual observations are projected into the representation space of a large language model before being decoded into robot actions. Although effective, this design incurs substantial computation and memory overhead at every policy invocation. In this work, we introduce TurboVLA, a new VLA paradigm that reformulates the conventional $V \to L \to A$ pathway as a direct $V + L \to A$ mapping. Instead of using a large language model as the central interface between perception and action, TurboVLA independently encodes visual observations and language instructions, directly exchanges information between them through lightweight bidirectional vision-language interaction, and predicts continuous action chunks with a compact decoder. This simple design constructs task-conditioned representations directly from visual and linguistic features, significantly reducing the computational and memory costs of VLA inference. On LIBERO, TurboVLA achieves 97.7% average success with only 0.2B parameters, 31.2 ms inference latency, and 0.9 GB inference VRAM on a consumer-grade RTX 4090, matching or outperforming substantially larger VLA policies. These results establish TurboVLA as a simple and effective alternative to the prevailing LLM-centric VLA paradigm, offering a new perspective on how vision, language, and action can be connected for efficient robotic manipulation. Code is available at https://github.com/H-EmbodVis/TurboVLA.
Yesterday20 min
CoRT: Counterfactual Replay for Token-Level Rubric-Guided Policy Optimization
🤗 Upvotes: 78 | cs.AI Authors: Bo-Wen Zhang, Junwei He, Wen Wang, Song-Lin Lv, Wentao Ma, Rongyi Lin, Shuhan Zhong, Lan-Zhe Guo Title: CoRT: Counterfactual Replay for Token-Level Rubric-Guided Policy Optimization Arxiv: http://arxiv.org/abs/2607.25659v1 Abstract: Rubric-based reinforcement learning enriches language model training by evaluating model outputs against explicit criteria. Yet in GRPO-style pipelines, these structured judgments are reduced to a scalar response-level reward and converted into a response-level advantage, which is broadcast uniformly to all generated tokens. This leaves no explicit mechanism for allocating credit within a response, even when different criteria are grounded in different spans, formatting decisions, or semantic choices. We propose CoRT, a token-level credit weighting method for rubric-conditioned GRPO. Instead of training an auxiliary token scoring model, CoRT uses counterfactual replay to rescore the same sampled response under the original rubric-conditioned prompt and a matched criteria-free prompt. The resulting tokenwise log-likelihood contrasts serve as a proxy for dependence on the rubric context. CoRT maps these contrasts to bounded, response-normalized weights and uses them to redistribute the signed GRPO advantage across tokens, without introducing an auxiliary scorer or changing the response-level reward. Experiments across instruction-tuned models and reward granularities show that CoRT improves over matched response-level GRPO in the vast majority of comparisons, with an average gain of 4.4 percentage points. The method remains competitive with learned token-level credit baselines while avoiding a separate relevance-learning stage. These results suggest that policy-internal counterfactual likelihood contrasts provide an effective training signal for within-response credit allocation while retaining the simplicity and stability of GRPO.
Yesterday19 min
HumanCLAW: Can Vision-Language Models Act Through a Body?
🤗 Upvotes: 67 | cs.CV, cs.RO Authors: Siyao Li, Jiawei Gu, Shuai Liu, Kairui Hu, Zekun Li, Linjie Li, Chengcheng Tang, Po-Chen Wu, Ivan Shugurov, Lingni Ma, Michael Zollhoefer, Sizhe An, Abhay Mittal, Amy Zhao, Ranjay Krishna, Manling Li, Ziwei Liu, Chuan Guo Title: HumanCLAW: Can Vision-Language Models Act Through a Body? Arxiv: http://arxiv.org/abs/2607.27180v1 Abstract: Evaluating whether a vision-language model (VLM) can act through a physical body is challenging. The outcome of an action couples the VLM's decision with motor control. When a task fails, it is hard to tell whether the VLM made a bad choice or the motor controller simply failed to execute it, e.g., losing balance and falling. In this work, we introduce HumanCLAW, an evaluation framework that decouples action decision-making from low-level execution. At every step, a harnessed, off-the-shelf VLM issues an atomic skill command, and the command is translated into a sub-second chunk of continuous full-body motion with real physical consequences, including gravity and collisions. The body can therefore act freely in the physical world, while execution-side disturbances, balance and motor errors, are factored out. What remains measurable is the model's action intelligence: its moment-to-moment choice of what the body should execute next. Based on this framework, we build HumanCLAW-Bench: 1,218 long-horizon, egocentric find-navigate-interact episodes across 41 indoor scenes. We test nine state-of-the-art VLMs and find that none solves the benchmark; the best model reaches only a 16.8% success rate. Recognizing the target is not the bottleneck. What current VLMs lack is embodied self-awareness: they lose track of their own body, failing to tell where it is, whether it has reached the goal, or whether it has hit an obstacle.
Yesterday19 min
DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space
🤗 Upvotes: 57 | cs.AI Authors: Jiangwang Chen, Zixin Song, Junlin Liu, Shuaiyu Zhou, Haiyan Wu, Haihan Shi, Chenxi Zhou, Hanqing Li, Xiao Yang, Da Zhu, Guanjun Jiang, Hai Wan, Xibin Zhao Title: DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space Arxiv: http://arxiv.org/abs/2607.25675v1 Abstract: Text-space optimization adapts large language models (LLMs) by editing external natural-language artifacts rather than model weights, so the optimized artifacts remain inspectable and the model can be treated as a black box. However, most existing text-space methods keep evaluation fixed. On open-ended tasks, this can become a bottleneck: once the solver improves on the criteria a rubric measures, omitted dimensions remain invisible to the optimization signal. Simply evolving the rubric is also unreliable when updates are selected by the current solver's score, because apparent progress can come from making the rubric easier to satisfy. We introduce DecoEvo (Decoupled Co-Evolution), which co-evolves a solver skill and a rubric-generator skill under decoupled objectives without using gold rubrics during optimization. The solver skill is updated using criterion-level feedback, while the rubric-generator skill is revised through complementary audits of requirement coverage and response discrimination that are independent of aggregate solver score. This separation focuses generator updates on newly exposed solver weaknesses, reducing repeated emphasis on criteria the solver already satisfies. Under each benchmark's official evaluation, DecoEvo outperforms all compared methods across five benchmarks and three LLM backbones, yielding 2.8--5.0\% relative gains over SkillOpt in the five-benchmark average.
Yesterday21 min
CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition
🤗 Upvotes: 41 | cs.CV, cs.AI, cs.CL, cs.LG Authors: Lai Wei, Chengqi Li, Jiapeng Li, Ruina Hu, Yue Wang, Weiran Huang Title: CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition Arxiv: http://arxiv.org/abs/2607.25294v1 Abstract: Real-world tasks often require models to learn from task-specific context rather than relying only on pre-trained knowledge. While recent work has highlighted this capability as context learning, existing evaluations mainly focus on textual contexts. In many practical settings, however, the context to be learned from is multimodal: scientific findings are conveyed through figures and tables, financial indicators are scattered across converted reports, and spatial decisions depend on maps, scenes, or web pages. We introduce CLBench-V, a benchmark for multimodal context learning that addresses the difficulty of localizing where context use breaks down by organizing tasks around three dimensions: context grounding, new information application, and new knowledge learning. CLBench-V combines converted public benchmarks with newly constructed datasets spanning domains such as science, finance, long-document understanding, spatial reasoning, and web-based visual question answering. To reduce the cost of constructing domain-specific context-learning tasks, we further use automated construction and filtering procedures for our newly built datasets. Across 3,443 instances and six recent multimodal models, the best overall score is only 0.2847, indicating that multimodal context learning remains far from saturated. Moreover, InternVL3.5-30B-A3B performs best on context grounding and new knowledge learning, while Qwen3.5-Plus performs best on new information application. We further analyze judge reliability, context length, image count, and representative failure cases. Code is available at https://github.com/IamLihua/CLBench-V.
Yesterday20 min
CAST: Game Solvers as Turn-Level Teachers for LLM Agents
🤗 Upvotes: 31 | cs.CL, cs.AI Authors: Yu Wang, Yi-Kai Zhang, Wentao Shi, Ziang Ye, Yuchun Miao, Yueqing Sun, Qi Gu, Xunliang Cai, Lan-Zhe Guo, Han-Jia Ye, Fuli Feng Title: CAST: Game Solvers as Turn-Level Teachers for LLM Agents Arxiv: http://arxiv.org/abs/2607.25308v1 Abstract: Training large language models (LLMs) to act in long-horizon games is a promising step toward generalist decision-making, yet reinforcement learning with verifiable rewards (RLVR) relies on sparse final rewards that reveal little about which decisions determine success. Denser process signals could supply this missing turn-level credit, but existing sources are hard to keep both cheap and accurate. We observe that changes in a game solver's state value reveal whether an action advances the state toward success. Building on this insight, we propose CAST (Credit Assignment from Solver Teachers), which converts these value changes into solver advantages and injects them into RLVR as turn-level signals. We further show that, under a soft-optimal solver assumption, maximizing the solver advantage is equivalent to on-policy distillation from the solver, requiring only scalar values rather than teacher logits. Across Sokoban, Minesweeper, and Rush Hour, CAST outperforms all trained baselines on every game under both in-domain and unseen-difficulty evaluation and achieves the highest average zero-shot performance on ALFWorld and WebShop. Our code is available at https://github.com/Wloner0809/CAST.
2 days ago19 min
HiFi-UMI: Learning Deployable Manipulation Policies from High-Fidelity UMI Data Alone
🤗 Upvotes: 137 | cs.RO, cs.CV, cs.LG Authors: Simple AI, :, Yuteng Wei, Jinming Ma, Jiawei Wang, Weitao Zhou, Yushen Zuo, Ke Rui, Minglei Li, Jinhao Zhang, Zhikang Pan, Xiang Wang, Haoran Jia, Huan Du, Zicheng Zeng, Jun Ma, Guiyu Qin, Di Zhang, Xiaofei Li Title: HiFi-UMI: Learning Deployable Manipulation Policies from High-Fidelity UMI Data Alone Arxiv: http://arxiv.org/abs/2607.25895v1 Abstract: Learning deployable manipulation policies is bottlenecked by the scarcity of data that is both high-fidelity and scalable. Real-robot teleoperation is accurate but costly to scale; robot-free UMI capture scales readily, and current practice uses the resulting data mainly for pre-training, adding a small real-robot "anchor" at post-training. We ask whether raising the fidelity of robot-free UMI data, rather than shrinking the real-robot fraction, can remove that anchor. We present HiFi-UMI, a portable UMI data-production system co-designed for trajectory accuracy, inter-gripper relative pose, synchronization, and field of view: head-mounted offline stereo-inertial SLAM, native rather than reconstructed relative pose, a shared microsecond GPIO trigger, and two wide-angle cameras per hand covering ~200 degrees. It reaches 3 mm workspace-local end-effector accuracy without external tracking infrastructure. Using this corpus, we demonstrate zero-robot post-training: a policy post-trained solely on HiFi-UMI demonstrations deploys directly on a real robot and matches in-domain teleoperation across three backbones spanning the vision-language-action and world-action-model families, with success-rate differences of -2.5, +3.1, and -0.6 percentage points on StarVLA-QwenPI, OpenPI-pi_0.5, and LingBot-VA; the strongest policy reaches 85% on a precision insertion task, even though the teleoperation baseline is collected in the evaluation scene and no HiFi-UMI trajectory is. Pre-training on 4,000 hours from the same corpus lowers action error on ten unseen tasks by 41% and, on StarVLA-QwenPI, raises real-robot success by a further 18.1 percentage points. We open-source HiFi-UMI-2K, 2,000 hours of microsecond-synchronized, ultra-wide-FoV demonstrations, each automatically reconstructed and validated through simulation replay, as a large-scale, high-fidelity resource for the robot-learning community.
2 days ago19 min
A New Role for Relevance: Guiding Corpus Interaction in Agentic Search
🤗 Upvotes: 85 | cs.CL Authors: Jiangnan Li, Yuqing Li, Mo Yu, Jinchao Zhang, Jie Zhou Title: A New Role for Relevance: Guiding Corpus Interaction in Agentic Search Arxiv: http://arxiv.org/abs/2607.24223v1 Abstract: Relevance is a query-dependent estimate of whether a document or excerpt contains useful evidence. Existing retrieval agents use relevance to select top-$k$ content, but document relevance alone cannot localize, compose, or verify the evidence required by complex questions. Direct Corpus Interaction (DCI) enables such fine-grained operations through grep-style exploration, but its relevance-agnostic search can expose useful clues late and delay convergence. Recent advances use relevance to narrow the corpus into a working space for interaction. Once interaction begins, however, relevance still does not directly guide which documents grep searches first or distinguish informative excerpts from a broad set of matches to let LLMs see them first. We introduce the Relevance-Aware RipGrep Search Agent (RARG), which turns relevance into an execution prior for corpus interaction. RARG provides coarse-to-fine relevance guidance: it orders documents for sequential 'ripgrep' traversal to expose globally relevant clues earlier, initializes promising entry points with query-relevant paragraphs, and reranks grep matches to surface informative excerpts that document-level ranking may otherwise obscure. Across challenging browse question answering and reasoning-intensive retrieval, RARG improves the accuracy--efficiency frontier over retrieval-based and direct-interaction agents. These results demonstrate that relevance-aware interaction enables faster and more reliable search convergence.

Get Daily Paper Cast summaries in your inbox

Free AI-powered daily recaps. Key takeaways, quotes, and mentions — in a 5-minute read.

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.

You Might Also Like

Listeners also like.

Latent Space: The AI Engineer Podcast

Covers advances in AI engineering, including foundation models, code generation, and AI agents, through interviews with researchers and developers.

Everyday AI Podcast – An AI and ChatGPT Podcast

Practical AI and ChatGPT tips for professionals to improve productivity and grow their careers.

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Interviews with AI developers and researchers exploring the transformative impact of artificial intelligence on society and technology.

OpenAI Podcast

Conversations with OpenAI researchers and builders exploring how frontier AI models are developed and used in practice.

This Day in AI Podcast

Two friends discuss artificial intelligence, sharing casual insights, personal experiments, and humorous experiences with AI tools and technology.

The AI Daily Brief: Artificial Intelligence News and Analysis

A daily analysis of artificial intelligence news, exploring its creative potential, industry impacts, and ethical challenges.

NVIDIA AI Podcast

Explores how artificial intelligence and emerging technologies are driving innovation across science, sustainability, and industry.

The AI XR Podcast.

Industry insiders interview top founders and executives on AI, spatial computing, VR/AR, and synthetic media.

The Anthropic AI Daily Brief

Breaks down Anthropic's latest AI advancements and their real-world implications in clear, accessible language.

How I AI

A practical guide to using AI tools in work and life, featuring guests who share specific, actionable techniques and workflows.

跨国串门儿计划

使用AI技术将英文播客翻译为中文，保留原声线，让中文听众听懂外语内容。

AI and I

Interviews with professionals who use AI tools in their work, exploring how AI affects creativity, thinking, and daily life through live demonstrations.

About Daily Paper Cast

We update every weekday to discuss highest-voted papers from Huggingface Daily Paper. Both the podcast scripts and audio are generated by AI. Creator: Jingwen Liang, 3D ML Gengyu Wang, LLM ML

By Jingwen Liang, Gengyu Wang

Science Technology

Customized Recaps

AI-powered recaps with compact key takeaways, quotes, and insights.

Straight to Your Inbox

Get key takeaways from Daily Paper Cast in a 5-minute read.

Save Hours Every Week

Stay current on your favorite podcasts without falling behind.

Frequently Asked Questions

What is Podzilla's Daily Paper Cast daily summary?

It's a free AI-powered email that summarizes new episodes of Daily Paper Cast as soon as they're published. You get the key takeaways, notable quotes, and links & mentions — all in a quick read.

How does the Daily Paper Cast podcast summary work?

When a new episode drops, our AI transcribes and analyzes it, then generates a personalized summary tailored to your interests and profession. It's delivered to your inbox every morning.

Is this an official Daily Paper Cast product?

No. Podzilla is an independent service that summarizes publicly available podcast content. We're not affiliated with or endorsed by Jingwen Liang, Gengyu Wang.

Can I get summaries of other podcasts too?

Absolutely! The free plan covers up to 3 podcasts. Upgrade to Pro for 15, or Premium for 50. Browse our full catalog at /podcasts.

How often does Daily Paper Cast release new episodes?

Daily Paper Cast publishes daily. Our AI generates a summary within hours of each new episode.

What topics does Daily Paper Cast cover?

Daily Paper Cast covers topics including Science, Technology. Our AI identifies the specific themes in each episode and highlights what matters most to you.

Start getting Daily Paper Cast summaries tomorrow morning.

Free forever for up to 3 podcasts. No credit card required.

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.

Daily Paper Cast: Daily Summaries Delivered

Latest Episodes

TurboVLA: Real-Time Vision-Language-Action Model at 32 Hz on an RTX 4090 with <1 GB VRAM

CoRT: Counterfactual Replay for Token-Level Rubric-Guided Policy Optimization

HumanCLAW: Can Vision-Language Models Act Through a Body?

DecoEvo: Score-Decoupled Co-Evolution of Solver and Rubric-Generator Skills in Text Space

CLBench-V: Evaluating Multimodal Context Learning from Grounding to Knowledge Acquisition

CAST: Game Solvers as Turn-Level Teachers for LLM Agents

HiFi-UMI: Learning Deployable Manipulation Policies from High-Fidelity UMI Data Alone

A New Role for Relevance: Guiding Corpus Interaction in Agentic Search

Get Daily Paper Cast summaries in your inbox

You Might Also Like

About Daily Paper Cast

Customized Recaps

Straight to Your Inbox

Save Hours Every Week

Frequently Asked Questions

Start getting Daily Paper Cast summaries tomorrow morning.