Episode #548: The Pixel Path: From Perception to Action, and the Future of Intelligent Robots with Nizar

May 25, 2026·56 min

Episode Description from the Publisher

Stewart Alsop interviews Nizar, CEO of Pixel Robotics, on the Crazy Wisdom Podcast to explore the intersection of AI, robotics, and perception. The conversation covers a wide range of technical topics including how transformers enable multimodal representation across text, images, and voice, the role of world models in predicting physical interactions, the advantages of diffusion models over traditional LLMs for certain applications, and the challenges of achieving real-time processing for robotics applications. Nizar explains Pixel Robotics' work on creating accurate 3D meshes from smartphone cameras for companies like L'Oréal, moving away from specialized sensors to make the technology more accessible through sophisticated algorithms, and discusses the future of robotics as closing the perception-action loop to enable robots to perform real tasks beyond simple demonstrations. To find out more visit Pixel Robotics' website.Timestamps00:00 Stewart welcomes Nizar, CEO of Pixel Robotics, discussing what a pixel is as the smallest visual unit on screens composed of red green and blue colors05:00 Discussion of perception systems and how logarithmic laws help compress signals in both human and artificial systems, exploring normalization layers and sigmoid functions in deep learning10:00 Exploring how transformers unified different data modalities including text voice and images, creating common representations through methods like contrastive learning15:00 Nizar explains transformers as brute force learning systems with room for improvement through focused attention mechanisms and knowledge graphs rather than processing everything20:00 Conversation about loss functions local minima versus global minima and how mixture of experts uses specialized small models instead of one massive generalist network25:00 Discussion of deterministic versus probabilistic systems and how explicitly defined task graphs often outperform orchestrator-based approaches in AI systems30:00 Exploring world models as predictive physics-based systems that learn environmental flows and transformations, complementing rather than replacing language models35:00 Nizar discusses real-time processing challenges for robotics requiring millisecond responses with small memory footprints using vision transformers for faster experimentation40:00 Pixel's work creating three d meshes from smartphone cameras for companies like L'Oreal, moving away from specialized sensors toward accessible software-based solutions45:00 Explanation of different three d representations including voxels point clouds and meshes, with meshes being optimal for manipulation and rendering in applications50:00 Future direction involves closing perception-action loops in robotics, moving beyond dancing toy robots toward practical multimodal systems that perform real tasks55:00 Pixel's goal is democratizing high-quality three d scanning through smartphones, making mesh creation accessible to unlock applications in gaming cinema and virtual showroomsKey Insights1. Pixel Robotics derives its name from combining perception and action in robotics, where the pixel represents the digital perception component and robotics represents the physical action component. The pixel serves as a metaphor for how robots must quantize and digitize continuous analog information from the real world into discrete units that computer systems can process, similar to how pixels are the fundamental building blocks of images on a screen. This quantization process is essential because numerical systems cannot work with truly continuous data and must convert reality into tractable digital representations that algorithms can manipulate.2. The transformer architecture has created a fundamental unification in how different types of data can be represented and processed across multiple modalities. Before transformers, researchers working on natural language processing, computer vision, and audio analysis used completely different approaches and methodologies. The breakthrough of transformers was establishing a common representational framework that could handle text, images, voice, and other data types using similar underlying mechanisms. This unification is what enabled the development of truly multimodal AI systems and represents one of the most significant advances beyond just the language modeling capabilities that initially gained public attention.3. Current transformer-based systems represent a brute force approach to learning that will likely be superseded or enhanced by more efficient algorithms. Despite claims that we have exhausted internet text data for training, significant improvements continue to emerge every few months through algorithmic innovations rather than simply adding more data. Future developments will likely involve more specialize

Podzilla Summary coming soon

Get Free Summaries →

Free forever for up to 3 podcasts. No credit card required.