Find where your cluster
is losing $300K+/year.
Most AI teams assume slow training is a hardware limitation. It isn't. 30–60% of GPU compute is lost to silent misconfigurations — NCCL buffer misalignment, MTU boundary fragmentation, parallelism strategy mismatch, and GPU idle time that never appears in utilization dashboards.
No commitment. If nothing is wrong, I'll tell you in 48h.
⚡ Only 2 diagnostic slots available this month.
Focused only on GPU cluster performance and distributed training systems.
only
Five deliverables. Every engagement.
Not a generic report. A precise breakdown of your cluster's inefficiency — layer by layer, with specific fixes and the dollar value of each.
Achieved FLOPs vs theoretical FLOPs across your training workload. Normalized for architecture-specific XLA/CUDA compilation and interconnect behavior. You'll know your exact efficiency floor.
Output: MFU % with benchmark comparisonSeparation of compute time, AllReduce/communication time, input pipeline time, and XLA compile time per training step. Most teams have never seen this breakdown — it shows exactly where time goes.
Output: Per-step time attribution chartAllReduce algorithm selection, buffer size alignment to MTU, SHARP enablement, ring vs tree topology analysis, IB vs RoCE tuning. NCCL misconfiguration is the #1 source of recoverable waste.
Output: NCCL tuning configuration fileWhat breaks going from 8→128 GPUs. Parallelism strategy analysis (DP vs TP vs PP), gradient synchronization overhead at scale, near-linear scaling gap identification.
Output: Scaling efficiency formulaWritten report with ranked findings, dollar value per finding, and specific configuration fixes. Followed by a 60-minute walkthrough call where every finding is explained and questions answered.
Output: PDF report + 60-min callAll benchmark methodology, scripts, and raw data shared. You own the data. Reproducible by your team. Not a black-box report.
Output: Benchmark repo accessWhat gets found. Every time.
These aren't hypothetical. They're the patterns that appear in every untuned cluster — and they have known fixes.
NCCL buffer size misaligned to fabric MTU. Messages fragment at the boundary — every AllReduce chunk crosses a 7× latency cliff. Invisible in training logs. Shows up as step-time variance.
NCCL defaulting to ring algorithm on a fabric designed for tree topology. Communication pattern fights the physical network — 30–50% AllReduce overhead. One environment variable to fix.
SHARP disabled — in-network compute sitting idle while CPU handles reductions it was never designed for. GPU waits. Utilization dashboard shows 85%+ but actual compute MFU is 15–25%.
Checkpointing strategy recomputing more than necessary — adding 15–25% to step time with no corresponding memory saving. Gradient accumulation misconfigured for the actual batch size.
The diagnostic pays for itself
before the report is delivered.
Four steps. Five days.
No long onboarding. No retainer. A focused engagement with a clear output.
Cluster size, training workload, current utilization metrics, what "slow" looks like. I confirm the diagnostic will find something worth fixing — or I'll tell you why not.
Day 0Cluster config, NCCL environment variables, nsys/DCGM profiles, nccl-tests output. Secure share — no production access required. I work from the data, not your systems.
Day 1Full diagnostic run — MFU measurement, step-time breakdown, NCCL audit, scaling analysis. Root cause identified and ranked by dollar impact.
Days 1–5Written PDF report with ranked findings, dollar value per finding, specific configuration fixes, and a 60-minute walkthrough call. You leave knowing exactly what to fix and in what order.
Day 5One price. No surprises.
A single fixed-price engagement. No retainer, no scope creep, no hourly billing. You know the cost before we start. The diagnostic either finds value or it doesn't — and I'll tell you upfront if I think your cluster won't qualify.
- MFU baseline measurement across your training workload
- Complete step-time breakdown — compute / comm / I/O / compile
- Full NCCL configuration audit with tuning config file
- Scaling efficiency diagnosis and parallelism analysis
- Written PDF report with ranked findings and dollar values
- 60-minute walkthrough call with Q&A
- Benchmark scripts and raw data — you own everything
- Follow-up support for 30 days on implementing the fixes
⚡ Only 2 diagnostic slots available this month.
No commitment required on first call. First insight in 48h.
Common questions.
Your cluster is wasting compute
right now. Find out how much.
A 20-minute call determines whether the diagnostic makes sense for your cluster. No commitment required.
If your cluster is already optimized, I'll tell you in 48h — before the full engagement runs.