The AI Concepts Podcast
The AI Concepts Podcast is my attempt to turn the complex world of artificial intelligence into bite-sized, easy-to-digest episodes. Imagine a space where you can pick any AI topic and immediately grasp it, like flipping through an Audio Lexicon - but even better! Using vivid analogies and storytelling, I guide you through intricate ideas, helping you create mental images that stick. Whether you’re a tech enthusiast, business leader, technologist or just curious, my episodes bridge the gap between cutting-edge AI and everyday understanding. Dive in and let your imagination bring these concepts to life!
The AI Concepts Podcast is my attempt to turn the complex world of artificial intelligence into bite-sized, easy-to-digest episodes. Imagine a space where you can pick any AI topic and immediately grasp it, like flipping through an Audio Lexicon - but even better! Using vivid analogies and storytelling, I guide you through intricate ideas, helping you create mental images that stick. Whether you’re a tech enthusiast, business leader, technologist or just curious, my episodes bridge the gap between cutting-edge AI and everyday understanding. Dive in and let your imagination bring these concepts to life!
Episodes
Wednesday Apr 08, 2026
Module 5: In-Context Learning, Zero-Shot, and Few-Shot Prompting
Wednesday Apr 08, 2026
Wednesday Apr 08, 2026
This episode explores in-context learning, the idea that you can dramatically change how a model behaves just by showing it examples inside the prompt, without changing a single weight. It walks through zero-shot, one-shot, and few-shot prompting, when each one tends to work best, and why examples shape not just the answer but also the format, tone, and structure of the response. It also gets into some of the more surprising research around this, including how models can still perform well even when example labels are wrong, why example order can materially affect accuracy, and why one strong example can sometimes outperform several mediocre ones. The episode closes by framing few-shot prompting as one of the most practical and powerful skills in prompt engineering, while also pointing to the limits of prompting when a task becomes too complex.
Wednesday Apr 08, 2026
Module 5: Prompt Engineering - How Decoding and Sampling Work
Wednesday Apr 08, 2026
Wednesday Apr 08, 2026
This episode explores the hidden layer between your prompt and the model’s response: decoding and sampling. We look at how the model moves from a field of possible next tokens to the one it actually chooses, why the same prompt can produce different outputs, and how that variation is shaped rather than random. We walk through the core strategies you will hear over and over in prompt engineering, from greedy decoding to temperature, top-k, and top-p, and the tradeoff each one creates between precision, consistency, creativity, and control. We also touch on why these settings matter differently depending on the task, and why newer reasoning models do not always play by the same rules.
Wednesday Apr 08, 2026
Do Business Leaders Really Need to Understand the Mechanics of AI?
Wednesday Apr 08, 2026
Wednesday Apr 08, 2026
In this episode, I explore a question I heard recently that sounds simple, but matters more than it seems.
Wednesday Feb 25, 2026
Module 4: Quantization - Shrinking Models Without Breaking Them
Wednesday Feb 25, 2026
Wednesday Feb 25, 2026
This episode tackles the lever that turns powerful LLMs into something you can actually run: quantization. We explore what it means to store model weights with fewer bits, why that can cut memory in half at 8-bit and down to roughly a quarter at 4-bit, and the real tradeoff between compression and capability as rounding error accumulates across billions of parameters. We break down why large models survive this better than small ones, why 8-bit is often near lossless, why 4-bit can still be shockingly strong, and why going below that can make models fall apart. We compare the three practical paths you will see in the wild: GPTQ (layer-wise compression with error compensation), AWQ (protecting the most important weights), and GGUF (the local-friendly format that makes CPU and GPU splitting possible).
Tuesday Feb 24, 2026
Module 4: Optimization - The GPU Memory Bottleneck
Tuesday Feb 24, 2026
Tuesday Feb 24, 2026
This episode addresses the real bottleneck after you build an LLM: fitting it into hardware that can actually run it. We explore why GPU memory is the scarce resource, how weights, KV cache, and activations compete for that space, and what that means in practice when prompts get long or concurrency spikes. We compare data center GPUs (high bandwidth HBM) versus local machines like the Mac Studio (huge unified memory but slower bandwidth) to show the core tradeoff between capacity and speed. By the end, you will understand how to choose hardware based on your goal, and why the next lever is quantization to shrink models enough to fit, with a closing reflection on perspective when something big feels like it will not fit.
Friday Feb 20, 2026
Module 3: Reinforcement Learning from Human Feedback
Friday Feb 20, 2026
Friday Feb 20, 2026
This episode addresses how Reinforcement Learning from Human Feedback (RLHF) adds the final layer of alignment after supervised fine-tuning, shifting the training signal from “right vs wrong” to “better vs worse.” We explore how preference rankings create a reward signal (reward models plus PPO) and the newer shortcut (DPO) that learns preferences directly, then connect RLHF to safety through the Helpful, Honest, Harmless goal. We also unpack the “alignment tax,” the trade-off between being safe and being genuinely useful, and close by setting up the next module on running models at scale, starting with GPU memory limits, plus a personal reflection on starting later without being behind.
Friday Feb 20, 2026
Module 3: Supervised Fine Tuning
Friday Feb 20, 2026
Friday Feb 20, 2026
This episode addresses how we turn a raw base model into something that behaves like a real assistant using Supervised Fine-Tuning (SFT). We explore instruction and response training data, why SFT makes behaviors consistent beyond prompting, and the practical engineering choices that keep fine-tuning efficient and safe, including low learning rates and LoRA-style adapters. By the end, you will understand what SFT solves, and why the next layer (RLHF) is needed to add human preference and nuance.
Monday Jan 26, 2026
Module 3: Context Windows & Attention Complexity
Monday Jan 26, 2026
Monday Jan 26, 2026
This episode addresses the physical and mathematical limits of a model’s "short-term memory." We explore the context window and the engineering trade-offs required to process long documents. You will learn about the quadratic cost of attention where doubling the input length quadruples the computational work and why this creates a massive bottleneck for long-form reasoning. We also introduce the architectural tricks like Flash Attention that allow us to push these limits further. By the end, you will understand why context is the most expensive real estate in the generative stack.
Sunday Jan 25, 2026
Module 3: The Lifecycle of an LLM : Pre-Training
Sunday Jan 25, 2026
Sunday Jan 25, 2026
This episode explores the foundational stage of creating an LLM known as the pre-training phase. We break down the Trillion Token Diet by explaining how models move from random weights to sophisticated world models through the simple objective of next token prediction. You will learn about the Chinchilla Scaling Laws or the mathematical relationship between model size and data volume. We also discuss why the industry shifted from building bigger brains to better fed ones. By the end, you will understand the transition from raw statistical probability to parametric memory.
Tuesday Jan 06, 2026
Module 2: The MLP Layer - Where Transformers Store Knowledge
Tuesday Jan 06, 2026
Tuesday Jan 06, 2026
Shay explains where a transformer actually stores knowledge: not in attention, but in the MLP (feed-forward) layer. The episode frames the transformer block as a two-step loop: attention moves information between tokens, then the MLP transforms each token’s representation independently to inject learned knowledge.




