AI Infrastructure Intelligence · Google TPU Research Cloud

Most AI teams don't have a hardware problem.
They have a systems problem.

30–60% of your GPU/TPU compute is being wasted.
We show you exactly where.

For teams running
Multi-node GPU clusters (A100 / H100)
LLM training workloads
Large-scale inference systems
Trusted by teams running 100+ GPU clusters · Benchmarked on Google TPU Research Cloud
Founded by Sankar Sathish · 14y Distributed Systems · 7y GPU Clusters · Utility Patent Filed
nsight_profile · cluster_128xA100
$ nydux profile --cluster a100-128 --workload llm-1.3b
MFU detected: 7.2%
Communication: 55% step time
Idle/waiting: 20% step time
Actual compute: 25% step time
Root cause: NCCL misconfiguration
Est. annual waste: $480,000
─────────────────────────────────
Post-optimization MFU: 40.1%
Training time: 7 days → 18 hours
Annual savings: $480,000
7%→40%MFU Uplift Achieved
7d→18hTraining Time Reduced
$480KAnnual Savings Recovered
128-GPUCluster Scale Optimized

Most teams don't realize they're burning $300K+ a year — until it's too late.

The Metric That Matters

GPU utilization shows activity.
Not useful work.

MFU (Model FLOPs Utilization) shows how much real compute you're actually getting from your hardware.

Most teams think 90% utilization means efficiency.
It doesn't.

"If your MFU is below 50%, you are wasting half your GPU budget."

The Formula
MFU = Achieved FLOPs ÷ Theoretical Peak FLOPs
Denominator must be exact (e.g., 275 TFLOPS × chip count for TPUv4). Exclude warmup steps. Measure at steady state.
35–45%
Industry average MFU for large-scale AI training
<15%
Where most unoptimized clusters actually sit
$300K+
Annual compute budget wasted at 15% MFU on a $500K cluster
48h
Time from NYDUX diagnostic to root cause identified
Root Cause Analysis

Where your compute
actually goes.

In a typical multi-node H100 cluster, this is what we find when we profile step time with Nsight Systems. The hardware isn't failing. The system is.

01 · Communication overhead (AllReduce · NCCL · gradient sync)55%
GPUs waiting for gradient sync
02 · Idle / waiting (GPUs stalled on sync)20%
Pure idle time
03 · Actual compute (MFU — real work)25%
Real work being done
Key Insight
Your GPUs are not compute-bound. They are waiting.

Most of your cluster time is spent on communication and synchronization — not actual compute. At $500K+/year, that's $300K+ wasted every year.
Observed across multiple real clusters · Profiled with Nsight Systems · Not theoretical
Business Impact

The cost of
not knowing.

$500K Annual GPU Infrastructure
Compute paid for
$500K
Wasted (15% MFU)
$300K+ idle
Useful compute
$200K
60× ROI
$5K diagnostic → $300K+ recovered annually. Every year you wait costs more than the diagnostic.
A 7-day training run should take 18 hours.
You're paying for 100% of your GPUs.

You're getting ~40% of the compute.

The rest is waste — often $300K+ per year.
Buying more GPUs amplifies the problem.
If your current cluster is 15% efficient, doubling GPU count doubles your waste. The systems problem scales with hardware. The diagnostic finds it before you scale.
One diagnostic. Permanent fix.
NCCL misconfiguration. Wrong batch sizing. AllReduce overhead. These are fixable in days — not months. The savings compound every year.
Engineering Authority

We don't guess.
We profile and fix at the system layer.

NCCL · AllReduce · Collective Ops
Communication Optimization
We tune NCCL_SOCKET_NTHREADS, NCCL_NSOCKS_PERTHREAD, and tree vs ring AllReduce topology. A misconfigured AllReduce can waste 40–55% of step time. We find it and fix it.
InfiniBand · RoCE · GPUDirect RDMA
Interconnect Engineering
InfiniBand fabric configuration, GPUDirect RDMA bandwidth verification, RoCE vs IB trade-off analysis. We profile actual bytes transferred vs theoretical link bandwidth at every step boundary.
JAX · XLA · TPUv4 · TPUv5e
TPU Systems Intelligence
TPU is a static-shape accelerator. Dynamic shapes trigger compilation storms. We enforce shape freezing, use JAX_LOG_COMPILES=1 to verify single-compile runs, and apply the PaLM formula: 6N + 12LHQT.
01
Freeze Shapes
TPU is static. Variable batch sizes or dynamic inputs trigger compilation storms. Ensure JAX_LOG_COMPILES=1 shows exactly ONE compile per run.
02
Block & Measure Once
Avoid per-step .item() calls. Pipeline asynchronously. Use block_until_ready only at defined boundaries. Exclude first 10 warmup steps.
03
The PaLM Formula
Use 6N + 12LHQT for accurate MFU. Exclude embedding parameters from N. Always state denominator (e.g., 275 TFLOPS × 8 chips).
Diagnostic Framework

Symptom → Root Cause → Fix.

This is how we think. Every inefficiency has a traceable root cause. Every root cause has a fix. Our diagnostic maps your cluster against this framework in 3–5 days.

Symptom Likely Root Cause NYDUX Fix
0 FLOPs on TPU row, host row busy Input pipeline stall Use grain or tf.data with aggressive prefetch
High bytes accessed, low MXU% Memory-bound operation Increase batch size, check d_model alignment
Short compute bursts, long sync gaps AllReduce communication overhead FSDP sharding, latency hiding scheduler
Constant recompile messages in logs Shape change mid-run (compile storm) Freeze shapes, check dropout seeds
Low GPU utilization despite high load CPU-GPU pipeline bottleneck Async data loading, pin memory, prefetch depth
MFU drops at scale (8→64 GPUs) NCCL ring topology misconfiguration Tune NCCL_SOCKET_NTHREADS, tree vs ring AllReduce
Engagement Model

Stop guessing.
Start measuring.

Identify your $300K. Fix it. Scale with confidence.

01 · Flagship
Identify $300K+ of Wasted GPU Capacity
$5K Diagnostic · Remote · 3–5 days
We break down exactly where your time and money are going — and show you how to recover it. We profile your cluster remotely, map your step-time breakdown, quantify the exact dollar value of recoverable efficiency, and deliver a written report with specific configuration fixes.
  • Complete step-time breakdown (compute vs communication vs idle)
  • NCCL and InfiniBand bottleneck analysis
  • Dollar value of recoverable efficiency — exact figure
  • Written report with configuration-level fixes
  • Typical outcome: 20–50% throughput improvement
  • ROI: 60× return in year one
Book Diagnostic →
02
ML Systems Mastery
Live Cohort · 10 weeks · Max 10 engineers
Deep-dive live training on GPU cluster internals. NCCL, GPUDirect RDMA, distributed training, MFU optimization. Bring your own production problems.
  • JAX & XLA deep dives
  • NCCL tuning from first principles
  • InfiniBand + GPUDirect RDMA
  • Nsight Systems profiling
  • Real workload optimization
Learn More →
03
Technical Partnership
Ongoing Engagement · Retainer
Embedded technical partner for infrastructure build, customer deployments, and tender support. We bring technical design — you bring the relationships.
  • Data center design & cluster topology
  • InfiniBand vs RoCE architecture
  • Multi-tenancy & GPU isolation
  • Tender & bid support
  • No upfront fees for bidding
Contact Sankar →
Ready to find your $300K?

Your GPUs are running.
Your money isn't.

Most clusters reveal $200K–$600K of recoverable efficiency in the first diagnostic session. No hardware changes. No new spend. Just systems intelligence.