The AI Concepts Podcast

The AI Concepts Podcast is my attempt to turn the complex world of artificial intelligence into bite-sized, easy-to-digest episodes. Imagine a space where you can pick any AI topic and immediately grasp it, like flipping through an Audio Lexicon - but even better! Using vivid analogies and storytelling, I guide you through intricate ideas, helping you create mental images that stick. Whether you’re a tech enthusiast, business leader, technologist or just curious, my episodes bridge the gap between cutting-edge AI and everyday understanding. Dive in and let your imagination bring these concepts to life!

Listen on:

Episodes

Feb 24, 2026

Module 4: Optimization - The GPU Memory Bottleneck

Feb 24, 2026

13 min

This episode addresses the real bottleneck after you build an LLM: fitting it into hardware that can actually run it. We explore why GPU memory is the scarce resource, how weights, KV cache, and activations compete for that space, and what that means in practice when prompts get long or concurrency spikes. We compare data center GPUs (high bandwidth HBM) versus local machines like the Mac Studio (huge unified memory but slower bandwidth) to show the core tradeoff between capacity and speed. By the end, you will understand how to choose hardware based on your goal, and why the next lever is quantization to shrink models enough to fit, with a closing reflection on perspective when something big feels like it will not fit.

Feb 20, 2026

Module 3: Reinforcement Learning from Human Feedback

Feb 20, 2026

9 min

This episode addresses how Reinforcement Learning from Human Feedback (RLHF) adds the final layer of alignment after supervised fine-tuning, shifting the training signal from “right vs wrong” to “better vs worse.” We explore how preference rankings create a reward signal (reward models plus PPO) and the newer shortcut (DPO) that learns preferences directly, then connect RLHF to safety through the Helpful, Honest, Harmless goal. We also unpack the “alignment tax,” the trade-off between being safe and being genuinely useful, and close by setting up the next module on running models at scale, starting with GPU memory limits, plus a personal reflection on starting later without being behind.

Feb 20, 2026

Module 3: Supervised Fine Tuning

Feb 20, 2026

8 min

This episode addresses how we turn a raw base model into something that behaves like a real assistant using Supervised Fine-Tuning (SFT). We explore instruction and response training data, why SFT makes behaviors consistent beyond prompting, and the practical engineering choices that keep fine-tuning efficient and safe, including low learning rates and LoRA-style adapters. By the end, you will understand what SFT solves, and why the next layer (RLHF) is needed to add human preference and nuance.

Jan 26, 2026

Module 3: Context Windows & Attention Complexity

Jan 26, 2026

10 min

This episode addresses the physical and mathematical limits of a model’s "short-term memory." We explore the context window and the engineering trade-offs required to process long documents. You will learn about the quadratic cost of attention where doubling the input length quadruples the computational work and why this creates a massive bottleneck for long-form reasoning. We also introduce the architectural tricks like Flash Attention that allow us to push these limits further. By the end, you will understand why context is the most expensive real estate in the generative stack.

Jan 25, 2026

Module 3: The Lifecycle of an LLM : Pre-Training

Jan 25, 2026

10 min

This episode explores the foundational stage of creating an LLM known as the pre-training phase. We break down the Trillion Token Diet by explaining how models move from random weights to sophisticated world models through the simple objective of next token prediction. You will learn about the Chinchilla Scaling Laws or the mathematical relationship between model size and data volume. We also discuss why the industry shifted from building bigger brains to better fed ones. By the end, you will understand the transition from raw statistical probability to parametric memory.

Jan 6, 2026

Module 2: The MLP Layer - Where Transformers Store Knowledge

Jan 6, 2026

7 min

Shay explains where a transformer actually stores knowledge: not in attention, but in the MLP (feed-forward) layer. The episode frames the transformer block as a two-step loop: attention moves information between tokens, then the MLP transforms each token’s representation independently to inject learned knowledge.

Jan 5, 2026

Module 2: The Encoder (BERT) vs. The Decoder (GPT)

Jan 5, 2026

8 min

Shay breaks down the encoder vs decoder split in transformers: encoders (BERT) read the full text with bidirectional attention to understand meaning, while decoders (GPT) generate text one token at a time using causal attention.
She ties the architecture to training (masked-word prediction vs next-token prediction), explains why decoder-only models dominate today (they can both interpret prompts and generate efficiently with KV caching), and previews the next episode on the MLP layer, where most learned knowledge lives.

Jan 5, 2026

Module 2: Multi Head Attention & Positional Encodings

Jan 5, 2026

9 min

Shay explains multi-head attention and positional encodings: how transformers run multiple parallel attention 'heads' that specialize, why we concatenate their outputs, and how positional encodings reintroduce word order into parallel processing.
The episode uses clear analogies (lawyer, engineer, accountant), highlights GPU efficiency, and previews the next episode on encoder vs decoder architectures.

Jan 3, 2026

Module 2: Inside the Transformer -The Math That Makes Attention Work

Jan 3, 2026

11 min

In this episode, Shay walks through the transformer's attention mechanism in plain terms: how token embeddings are projected into queries, keys, and values; how dot products measure similarity; why scaling and softmax produce stable weights; and how weighted sums create context-enriched token vectors.
The episode previews multi-head attention (multiple perspectives in parallel) and ends with a short encouragement to take a small step toward your goals.

Jan 3, 2026

Module 2: Attention Is All You Need (The Concept)

Jan 3, 2026

11 min

Shay breaks down the 2017 paper "Attention Is All You Need" and introduces the transformer: a non-recurrent architecture that uses self-attention to process entire sequences in parallel.
The episode explains positional encoding, how self-attention creates context-aware token representations, the three key advantages over RNNs (parallelization, global receptive field, and precise signal mixing), the quadratic computational trade-off, and teases a follow-up episode that will dive into the math behind attention.