ML Systems Mastery

Learn to find where
GPU compute
goes to die.

10 weeks. Real data. No slides. No theory.
You will run NCCL benchmarks on GCP GPU instances — scripts, profiling data, and access provided — recover MFU, debug AllReduce latency, and build a public benchmark portfolio on GitHub.

10
Weeks
90min
Weekly live sessions
7→40%
MFU recovered — real result
$300K
Waste identified per cluster
Real result from a 128× A100 cluster:
MFU: 7% → 40%  ·  AllReduce latency: 4.2ms → 0.8ms  ·  $300K+ waste recovered
The cluster was "running normally" the whole time.

Most GPU engineers are optimizing the wrong thing. They tune hyperparameters. They upgrade hardware. They never look at what's actually killing cluster efficiency.

⚡ INVISIBLE
AllReduce consuming 55% of your training step — not compute. Communication overhead. Silent.
⚡ INVISIBLE
Your MFU is 7% — not 40%. The difference is $300K+ in wasted H100 time annually.
⚡ INVISIBLE
One MTU misconfiguration = 7× latency spike. Running for months. Nobody noticed.
⚡ INVISIBLE
Your NCCL ring is fighting your physical topology — every single training run. Silently degrading.
These aren't framework issues. They're systems problems.
And most engineers are never trained to see them.
7→40%
MFU recovery on a 128× A100 cluster after NCCL ring topology fix
4.2→0.8ms
AllReduce latency reduction after MTU misconfiguration fix
$300K+
Annual GPU waste identified and recovered in one 5-day diagnostic
1Week
GPU Cluster Fundamentals
MFU, step-time breakdown, where compute actually goes. Why your dashboard lies.
2Week
NCCL Deep Dive
AllReduce algorithms, ring vs tree topology, InfiniBand vs RoCE. What breaks at scale.
3Week
Profiling Toolkit
nsys, nccl-tests, DCGM — finding real bottlenecks. Hands-on with real cluster data.
4Week
Communication Optimization
SHARP, GPUDirect RDMA, buffer tuning. Squeezing every MB/s out of your interconnect.
5Week
Memory Efficiency
Activation checkpointing, gradient accumulation, batch sizing for maximum throughput.
6Week
Parallelism Strategies
DP vs TP vs PP — real trade-offs at scale. When to use what and why.
7Week
Multi-Node Scaling
What breaks from 8 → 128 GPUs. The failure modes nobody talks about.
8Week
Real Cluster Case Study
7% → 40% MFU — full diagnosis walkthrough. Every decision, every tool, every fix.
9Week
Cost Optimization
Identifying $300K+ GPU waste, building the business case, presenting to leadership.
10Week
Capstone — Live Diagnosis
Bring your own cluster problem. Live diagnosis session. Public benchmark portfolio on GitHub.
90
Minutes / session
Live
Google Meet · recorded
Real
Clusters + profiling tools
GitHub
Public benchmark portfolio
ML Engineers who want to move into AI Infrastructure — the highest-leverage, least-crowded engineering vertical right now
GPU Engineers who are tired of guessing why training is slow and want to diagnose clusters systematically
Infrastructure Engineers scaling from 8 → 128 GPUs and hitting walls nobody can explain
Senior Engineers building AI infrastructure in the Gulf region — Saudi Arabia, UAE, India — where GPU spend is under scrutiny
This cohort is NOT for: Beginners who haven't trained a model. People looking for theory. Anyone expecting slides.
Founding Cohort
Cohort 1
Starts Saturday May 10 · 10AM PST / 10:30PM IST
$1,499
/ seat · one-time
  • 10-week live curriculum
  • Founding cohort pricing — locked in
  • NCCL benchmarks on GCP GPU instances — access provided
  • GitHub benchmark portfolio
  • Direct access to Sankar
⚠ 1 seat remaining
DM "GPU" on LinkedIn →
Not sure which cohort? DM on LinkedIn and we'll figure it out together.
S
Sankar Panneer Selvam
Founder, NYDUX · AI Infrastructure Intelligence
14 years in enterprise distributed systems. 7 years specializing in GPU cluster engineering, HPC, and LLM training infrastructure. I've built and optimized GPU clusters at Capgemini, Ericsson, Ford Motor Company, Syntel, and HCL Technologies.

I don't teach from textbooks. I teach from real cluster failures, real profiling data, and real $300K recoveries. Every week in this cohort comes from something I've actually debugged in production.
NVIDIA DLI Certified
IIT Madras AI/ML
PyTorch DDP · FSDP · DeepSpeed ZeRO
NCCL · InfiniBand · TensorRT-LLM
14 Years Enterprise GPU
Do I need access to a GPU cluster?
No. I provide GCP GPU instance access, profiling scripts, real benchmark data, and step-time results from production diagnostics. You run the analysis on real infrastructure. If you have your own cluster access, even better — we'll use it in Week 10.
What if I miss a session?
Every session is recorded. You'll have full access. That said — the live diagnosis sessions in Weeks 8 and 10 are where the real learning happens. Try to make those.
What infrastructure do we use for hands-on benchmarking?
We use GCP Spot VM instances for live benchmarking — cost is covered within the cohort fee. You'll also work with real profiling data from production cluster diagnostics (128× A100, anonymized). The GitHub benchmark repo is public: github.com/sankarbaseone/nydux-gpu-benchmarks.
I'm based in the Gulf / US / UK — does timing work?
Sessions are Saturdays at 10AM PST / 6PM Saudi / 6PM UK / 10:30PM IST. Gulf engineers: this is your Saturday evening. US engineers: Saturday morning. Works across all key markets.
What's the difference between Cohort 1 and Cohort 2?
Same 10-week curriculum. Cohort 1 is the founding cohort at $1,499 — one seat left. Cohort 2 is the validated cohort at $1,999 with Cohort 1 case study as live proof. Cohort 2 has 10 seats and closes June 7.
Why $1,999 for Cohort 2?
Cohort 1 was founding pricing. Cohort 2 is the validated product with a real case study. The skills you learn here let you identify $300K+ in GPU waste. $1,999 is the ROI in the first week.
How do I enroll?
DM "GPU" (Cohort 1) or "GPU2" (Cohort 2) on LinkedIn. I'll respond within 24 hours with the payment link and onboarding details.

Stop guessing.
Start diagnosing.

10 weeks. Real GCP GPU data. The systems skills that separate GPU engineers from GPU infrastructure engineers.

Questions? DM on LinkedIn · nydux.ai