AI Infrastructure Intelligence · Google TPU Research Cloud

Most AI teams don't have a hardware problem.
They have a systems problem.

30–60% of your GPU/TPU compute is being wasted.
We show you exactly where.

See Where Your GPUs Are Wasting $300K → See the Breakdown

For teams running

→ Multi-node GPU clusters (A100 / H100)

→ LLM training workloads

→ Large-scale inference systems

Trusted by teams running 100+ GPU clusters · Benchmarked on Google TPU Research Cloud

Founded by Sankar Sathish · 14y Distributed Systems · 7y GPU Clusters · Utility Patent Filed

nsight_profile · cluster_128xA100

$ nydux profile --cluster a100-128 --workload llm-1.3b

MFU detected: 7.2%

Communication: 55% step time

Idle/waiting: 20% step time

Actual compute: 25% step time

Root cause: NCCL misconfiguration

Est. annual waste: $480,000

─────────────────────────────────

Post-optimization MFU: 40.1%

Training time: 7 days → 18 hours

Annual savings: $480,000

The Metric That Matters

GPU utilization shows activity.
Not useful work.

MFU (Model FLOPs Utilization) shows how much real compute you're actually getting from your hardware.

Most teams think 90% utilization means efficiency.
It doesn't.

"If your MFU is below 50%, you are wasting half your GPU budget."

The Formula

MFU = Achieved FLOPs ÷ Theoretical Peak FLOPs

Denominator must be exact (e.g., 275 TFLOPS × chip count for TPUv4). Exclude warmup steps. Measure at steady state.

35–45%

Industry average MFU for large-scale AI training

<15%

Where most unoptimized clusters actually sit

$300K+

Annual compute budget wasted at 15% MFU on a $500K cluster

48h

Time from NYDUX diagnostic to root cause identified

Root Cause Analysis

Where your compute
actually goes.

In a typical multi-node H100 cluster, this is what we find when we profile step time with Nsight Systems. The hardware isn't failing. The system is.

01 · Communication overhead (AllReduce · NCCL · gradient sync)55%

GPUs waiting for gradient sync

02 · Idle / waiting (GPUs stalled on sync)20%

Pure idle time

03 · Actual compute (MFU — real work)25%

Real work being done

Key Insight

Your GPUs are not compute-bound. They are waiting.

Most of your cluster time is spent on communication and synchronization — not actual compute. At $500K+/year, that's $300K+ wasted every year.

Observed across multiple real clusters · Profiled with Nsight Systems · Not theoretical

Business Impact

The cost of
not knowing.

$500K Annual GPU Infrastructure

Compute paid for

$500K

Wasted (15% MFU)

$300K+ idle

Useful compute

$200K

60× ROI

$5K diagnostic → $300K+ recovered annually. Every year you wait costs more than the diagnostic.

A 7-day training run should take 18 hours.

You're paying for 100% of your GPUs.

You're getting ~40% of the compute.

The rest is waste — often $300K+ per year.

Buying more GPUs amplifies the problem.

If your current cluster is 15% efficient, doubling GPU count doubles your waste. The systems problem scales with hardware. The diagnostic finds it before you scale.

One diagnostic. Permanent fix.

NCCL misconfiguration. Wrong batch sizing. AllReduce overhead. These are fixable in days — not months. The savings compound every year.

Engineering Authority

We don't guess.
We profile and fix at the system layer.

NCCL · AllReduce · Collective Ops

Communication Optimization

We tune NCCL_SOCKET_NTHREADS, NCCL_NSOCKS_PERTHREAD, and tree vs ring AllReduce topology. A misconfigured AllReduce can waste 40–55% of step time. We find it and fix it.

InfiniBand · RoCE · GPUDirect RDMA

Interconnect Engineering

InfiniBand fabric configuration, GPUDirect RDMA bandwidth verification, RoCE vs IB trade-off analysis. We profile actual bytes transferred vs theoretical link bandwidth at every step boundary.

JAX · XLA · TPUv4 · TPUv5e

TPU Systems Intelligence

TPU is a static-shape accelerator. Dynamic shapes trigger compilation storms. We enforce shape freezing, use JAX_LOG_COMPILES=1 to verify single-compile runs, and apply the PaLM formula: 6N + 12LHQT.

Freeze Shapes

TPU is static. Variable batch sizes or dynamic inputs trigger compilation storms. Ensure JAX_LOG_COMPILES=1 shows exactly ONE compile per run.

Block & Measure Once

Avoid per-step .item() calls. Pipeline asynchronously. Use block_until_ready only at defined boundaries. Exclude first 10 warmup steps.

The PaLM Formula

Use 6N + 12LHQT for accurate MFU. Exclude embedding parameters from N. Always state denominator (e.g., 275 TFLOPS × 8 chips).

Diagnostic Framework

Symptom → Root Cause → Fix.

This is how we think. Every inefficiency has a traceable root cause. Every root cause has a fix. Our diagnostic maps your cluster against this framework in 3–5 days.

Symptom	Likely Root Cause	NYDUX Fix
0 FLOPs on TPU row, host row busy	Input pipeline stall	Use grain or tf.data with aggressive prefetch
High bytes accessed, low MXU%	Memory-bound operation	Increase batch size, check d_model alignment
Short compute bursts, long sync gaps	AllReduce communication overhead	FSDP sharding, latency hiding scheduler
Constant recompile messages in logs	Shape change mid-run (compile storm)	Freeze shapes, check dropout seeds
Low GPU utilization despite high load	CPU-GPU pipeline bottleneck	Async data loading, pin memory, prefetch depth
MFU drops at scale (8→64 GPUs)	NCCL ring topology misconfiguration	Tune NCCL_SOCKET_NTHREADS, tree vs ring AllReduce

Engagement Model

Stop guessing.
Start measuring.

Identify your $300K. Fix it. Scale with confidence.

01 · Flagship

Identify $300K+ of Wasted GPU Capacity

$5K Diagnostic · Remote · 3–5 days

We break down exactly where your time and money are going — and show you how to recover it. We profile your cluster remotely, map your step-time breakdown, quantify the exact dollar value of recoverable efficiency, and deliver a written report with specific configuration fixes.

Complete step-time breakdown (compute vs communication vs idle)
NCCL and InfiniBand bottleneck analysis
Dollar value of recoverable efficiency — exact figure
Written report with configuration-level fixes
Typical outcome: 20–50% throughput improvement
ROI: 60× return in year one

Book Diagnostic →

ML Systems Mastery

Live Cohort · 10 weeks · Max 10 engineers

Deep-dive live training on GPU cluster internals. NCCL, GPUDirect RDMA, distributed training, MFU optimization. Bring your own production problems.

JAX & XLA deep dives
NCCL tuning from first principles
InfiniBand + GPUDirect RDMA
Nsight Systems profiling
Real workload optimization

Learn More →

Technical Partnership

Ongoing Engagement · Retainer

Embedded technical partner for infrastructure build, customer deployments, and tender support. We bring technical design — you bring the relationships.

Data center design & cluster topology
InfiniBand vs RoCE architecture
Multi-tenancy & GPU isolation
Tender & bid support
No upfront fees for bidding

Contact Sankar →

Most AI teams don't have a hardware problem.They have a systems problem.

GPU utilization shows activity.Not useful work.

Where your computeactually goes.

The cost ofnot knowing.

We don't guess.We profile and fix at the system layer.

Symptom → Root Cause → Fix.

Stop guessing.Start measuring.

Your GPUs are running.Your money isn't.