NCA-AIIO · Data Center, Networking & DPUs

GPU Rack Power Cooling Methods InfiniBand RoCE RDMA NDR 400Gb/s BlueField DPU SHARP Spectrum-X PUE On-Prem vs Cloud DLC

Data Center, Networking & DPUs covers the physical and network infrastructure powering modern AI GPU clusters. This is part of the AI Infrastructure domain, representing 40% of the NCA-AIIO exam. Power/cooling densities, InfiniBand fabric design, DPU offload capabilities, and the on-prem vs cloud decision are all specifically called out in official exam objectives.

What You'll Master

Power Requirements for GPU Clusters

H100 SXM5 = 700W per GPU; DGX H100 = ~10.2 kW. AI GPU racks draw 30–60+ kW — 3–6× traditional servers. Understand PDU sizing, circuit requirements, and redundancy (N+1, 2N).

Cooling Methods & Selection

Air cooling up to ~30 kW/rack. Direct Liquid Cooling (DLC) for 30–60+ kW. Immersion for extreme density. B200 systems require DLC. PUE measures facility efficiency.

Facility Requirements

Floor loading for heavy GPU servers (~290 lbs per DGX H100), electrical panel capacity, structured cabling, ASHRAE thermal guidelines, hot-aisle/cold-aisle containment.

AI Networking Protocols

RDMA = zero-CPU memory transfer. InfiniBand = lossless native RDMA. RoCE = RDMA over Ethernet. TCP/IP for management. MPI/NCCL for distributed training collectives.

High-Speed DC Network Options

NDR InfiniBand (400Gb/s) for compute fabric. Spectrum-X Ethernet for RoCE-optimized AI. NVLink for intra-node GPU interconnect. SHARP for in-network all-reduce aggregation.

BlueField DPU & On-Prem vs Cloud

DPU = 3rd pillar: offloads networking/storage/security from CPU. DOCA SDK programs DPU services. On-prem vs cloud: CapEx/OpEx, lead times, colocation, DGX Cloud options.

Exam Weight

Domain	Coverage	Exam Questions (est.)
AI Infrastructure — incl. Data Center, Networking & DPUs (this page)	40%	~20 questions
Essential AI Knowledge	~38%	~19 questions
AI Operations & MLOps	~22%	~11 questions
Total exam: 50 questions, 60 minutes, passing ~70%

Concept 1 — Power Requirements for AI GPU Clusters

GPU TDP (Thermal Design Power)

H100 SXM5 = 700W per GPU. 8-GPU DGX H100 server = ~10.2 kW total (GPUs + CPUs + cooling fans + networking). B200 = up to 1,000W per GPU.

Rack Power Density

Traditional IT racks = 5–10 kW. AI GPU racks = 30–60+ kW. A 4U rack housing 2 DGX H100 systems = ~21 kW. This density is 3–6× traditional servers.

Power Distribution

Requires high-amperage PDUs; 3-phase power distribution; dedicated electrical circuits per rack. DGX H100 requires 200–240V AC, 60A circuit per server.

Power Redundancy

N+1 = one extra PSU per server. 2N = fully duplicated power path from UPS to rack. AI production environments require at minimum N+1 for PSUs and dual power feeds from separate UPS/PDUs.

UPS and Backup Power

Uninterruptible Power Supply (UPS) provides ride-through during power events. Generator backup for extended outages. Checkpointing AI training jobs protects against unexpected power loss.

Power Usage Effectiveness (PUE)

Ratio of total facility power to IT equipment power. Ideal = 1.0. Good AI DC = 1.1–1.3. High GPU cooling load makes low PUE challenging. Liquid cooling improves PUE significantly.

Concept 2 — Cooling Requirements and Methods

Thermal Challenge

H100 GPU generates 700W of heat in a small package. 8 GPUs = 5.6 kW of heat from GPUs alone. Traditional air cooling struggles at densities above 20–25 kW per rack.

Air Cooling (CRAC/CRAH)

Computer Room Air Conditioning/Handler. Hot-aisle/cold-aisle containment; chilled water coils. Adequate for racks up to ~30 kW with careful airflow management. DGX H100 can be air-cooled.

Direct Liquid Cooling (DLC)

Coolant flows through cold plates directly on GPU/CPU chips. Highest thermal efficiency. Supports rack densities of 100+ kW. DGX H100 supports DLC. Required for B200-based systems at full TDP.

Rear-Door Heat Exchangers

Liquid-cooled door mounted at rack rear. Captures hot air exhaust and transfers to chilled water loop. Retrofittable to existing racks. Good for densities up to 60 kW.

Immersion Cooling

Servers submerged in dielectric fluid (single-phase or two-phase). Extreme density (100+ kW per tank). Highest PUE improvement. High upfront cost. Not yet mainstream for GPU clusters.

Cooling Selection Guide

Up to 30 kW/rack = air cooling. 30–60 kW = DLC or rear-door. 60+ kW = full liquid cooling or immersion. B200 systems essentially require DLC.

Concept 3 — Data Center Facility Requirements

Floor Loading

Standard raised floors handle ~500–1,000 lbs/sq ft. DGX H100 = ~130 kg / 290 lbs for a 10U system. High-density GPU deployments may require floor reinforcement or specific rack placement.

Space and Rack Planning

DGX H100 = 10U. Determine rack units (U) needed per server. Plan for cable management, hot-aisle/cold-aisle, and service access in high-density GPU layouts.

Electrical Infrastructure

GPU clusters require dedicated electrical panels. DGX POD (20 nodes) ≈ 200 kW for compute alone. Add 30–50% overhead for cooling infrastructure.

Network Cabling

InfiniBand cables: copper DAC for short runs (<3m); fiber optic AOC/optical for longer runs. Each GPU server has multiple InfiniBand HCAs (one per GPU in rail-optimized topology).

Environmental Controls

ASHRAE Class A3/A4 allows up to 40°C inlet temperature. NVIDIA recommends 18–27°C cold aisle for air-cooled systems. Humidity: 40–60% relative humidity.

Raised Floor vs Overhead Cabling

Raised floor enables cold air distribution. Overhead cabling reduces floor obstructions. Modern AI DCs often use a combination. Overhead power busbars for high-amperage distribution.

Concept 4 — Data Center Networking Protocols and Key Concepts

RDMA (Remote Direct Memory Access)

Allows one server to directly read/write another server's memory without involving the remote CPU. Zero-copy; latency ~1–2 microseconds. Essential for efficient all-reduce in distributed training.

InfiniBand (IB)

Purpose-built HPC network. Lossless by design; built-in flow control; native RDMA. Speeds: EDR (100Gb/s) → HDR (200Gb/s) → NDR (400Gb/s) → XDR (800Gb/s upcoming). NVIDIA Quantum switches.

RoCE (RDMA over Converged Ethernet)

RDMA semantics over standard Ethernet. Requires lossless Ethernet (Priority Flow Control + ECN). Lower infrastructure cost than IB. NVIDIA Spectrum-X optimizes RoCE for AI.

TCP/IP Ethernet

Standard networking. Lossy by design (TCP retransmission). No native RDMA. Suitable for management networks, inference serving, storage access. 1/10/25/100/200/400GbE options.

MPI and NCCL

MPI (Message Passing Interface): programming model for distributed computing — defines collectives (all-reduce, broadcast, scatter, gather). NCCL: NVIDIA's GPU-optimized collective communications library used by AI frameworks.

Subnet Manager

Required for InfiniBand networks. Manages topology discovery, routing, and QoS. OpenSM (open-source) or NVIDIA UFM (Unified Fabric Manager) for production. Must be highly available in large clusters.

Concept 5 — High-Speed DC Network Options

NVIDIA Quantum-2 InfiniBand (NDR)

400Gb/s per port. Non-blocking fat-tree switch (QM9700 = 64 ports). Latency <130ns. Used in DGX POD/SuperPOD. Supports SHARP in-network aggregation.

NVIDIA Spectrum-X (Ethernet for AI)

RoCE-optimized 400GbE Ethernet. Combines Spectrum-4 switch + BlueField-3 DPU. NVIDIA Adaptive Routing and congestion control. Achieves near-InfiniBand performance on Ethernet for large-scale AI training.

NVLink (Intra-Node)

Direct GPU-to-GPU interconnect within a server. H100 NVLink 4: 900 GB/s bidirectional per GPU pair. Not a traditional network — only within a node (or GB200 NVLink domain with NVLink Switch).

Storage Networking

InfiniBand can carry storage traffic (RDMA/NFS, Lustre, GPFS over IB). Dedicated storage Ethernet (25/100GbE) for object storage. GPUDirect Storage uses IB RDMA for direct GPU-storage transfers.

Out-of-Band Management Network

Separate 1GbE Ethernet for BMC/iDRAC access (IPMI). Always on even when server is off. Used for: remote power on/off, console access, BIOS updates, health monitoring. Never share with data plane.

Bandwidth Planning Example

8-GPU node with NDR IB = 8 ports × 400Gb/s = 3.2 Tb/s compute fabric bandwidth per node. Must be connected to non-blocking switch to avoid oversubscription in large-scale training.

Concept 6 — DPU (Data Processing Unit) — NVIDIA BlueField

What Is a DPU?

"Third pillar" of computing (alongside CPU and GPU). A programmable network processor with ARM cores, high-bandwidth network connectivity, and hardware accelerators for networking, storage, and security.

NVIDIA BlueField-3 Specs

400Gb/s InfiniBand or Ethernet connectivity. 16 Arm Cortex-A78 cores. Hardware offload for RDMA, cryptography, packet processing, storage. PCIe Gen5 to host CPU.

DPU Use Cases in AI Data Centers

Network offload — move RDMA/IB processing off CPU
Storage offload — NVMe-oF target/initiator processing
Security — encrypted networking, zero-trust enforcement
Isolation — host vs. tenant isolation in cloud environments

DPU vs SmartNIC

SmartNIC = network card with some on-card processing. DPU = full programmable data center infrastructure accelerator with its own OS (Linux on ARM), persistent services, and hardware accelerators — much more capable than SmartNIC.

How DPUs Help AI Workloads

CPU cores freed from networking/storage I/O can be used for data preprocessing. RDMA offload reduces latency for all-reduce. Enables microsegmentation and encrypted data flows without performance penalty.

NVIDIA DOCA SDK

Data Center Infrastructure on a Chip Architecture. SDK for programming BlueField DPUs. Provides APIs for networking, storage, security services. Enables containerized services running on DPU ARM cores.

Concept 7 — AI Networking Considerations for Training Workloads

All-Reduce Bandwidth Requirement

Gradient synchronization generates traffic proportional to model size × 2 × (N-1)/N per step. For large models, hundreds of GB per step across all nodes. Fabric must be non-blocking.

Congestion Control

Multiple flows compete for bandwidth in large IB/RoCE fabrics. Without congestion control, packet drops (RoCE) or credit stalls (IB) reduce throughput. NVIDIA Adaptive Routing + ECN-based congestion control address this.

Latency Sensitivity

Synchronous training means all GPUs wait for slowest all-reduce. Even small latency spikes (straggler nodes) impact cluster efficiency. Low-latency networking (IB) minimizes waiting time.

SHARP — In-Network Computing

Scalable Hierarchical Aggregation and Reduction Protocol. NVIDIA Quantum-2 performs gradient aggregation inside the switch — not at end nodes. Significant bandwidth savings for large-scale training.

Topology and Oversubscription

Non-blocking fat-tree has no oversubscription — ideal for training. 2:1 oversubscription halves effective bandwidth. Training workloads should use non-blocking; inference may tolerate some oversubscription.

Storage Network Separation

Always separate training data plane (GPU-to-GPU) from storage network. Mixing causes contention. Use dedicated IB partitions or separate physical networks for storage traffic.

Concept 8 — On-Premises vs Cloud for Data Center Operations

Control and Customization

On-prem: full control over hardware configuration, networking design (IB topology), cooling choices, and software stack. Cloud abstracts this — you get VMs/instances, not bare metal usually.

Procurement and Lead Times

On-prem GPU hardware: lead times of 3–6+ months for large orders. Cloud allows instant access to current-generation GPUs. Critical advantage for time-sensitive projects.

Cost Model

On-prem = CapEx (large upfront) + OpEx (power, cooling, staff). Cloud = pure OpEx (pay per hour). 3-year TCO: for >50% GPU utilization, on-prem often wins; below that, cloud wins.

Colocation (Colo)

Company-owned hardware in a shared data center facility. Company provides servers; colo provides power, cooling, physical security, connectivity. Good middle ground — infrastructure control without building/operating own DC.

NVIDIA DGX Cloud

NVIDIA-managed DGX systems accessible via cloud API. Combines DGX performance with cloud accessibility. 1-, 3-month commitments. Available on Oracle Cloud, Azure, GCP, AWS.

Sustainability Considerations

On-prem allows custom renewable energy procurement. Cloud providers publish sustainability metrics (AWS, Google, Azure all have carbon commitments). Liquid cooling + renewables are key for GPU-intensive AI workloads.

Six memory hooks to lock in the most testable concepts from this topic — fast recall for exam day.

⚡

GPU Rack Power Rule

Traditional server rack: 5–10 kW. AI GPU rack: 30–60+ kW. One DGX H100 alone = ~10 kW. Always check facility power capacity before deploying GPU clusters — you'll need 3–6× more than traditional IT.

🪜

Cooling Ladder

0–30 kW/rack → Air cool. 30–60 kW → Direct Liquid Cooling (DLC) or rear-door. 60+ kW → Full immersion. B200 systems essentially require DLC — remember this exception specifically.

🧠

RDMA = Zero CPU

RDMA (Remote Direct Memory Access) = one server writes directly to another's memory bypassing the CPU. That's why InfiniBand and RoCE are fast for distributed training — no CPU = no bottleneck.

🏛️

DPU = 3rd Pillar

CPU processes apps. GPU processes AI math. DPU processes networking/storage/security. BlueField-3 frees CPU from I/O, encrypts traffic, offloads RDMA — without touching GPU cycles.

🔀

IB vs RoCE Choice

InfiniBand: lossless native, lowest latency, purpose-built HPC. RoCE: Ethernet-based, needs Priority Flow Control, slightly higher latency. Both support RDMA. IB = premium; RoCE = cost-effective at scale.

📡

SHARP = In-Network All-Reduce

NVIDIA Quantum-2 switches perform gradient aggregation inside the switch — not at end nodes. Reduces all-reduce traffic by N× for N nodes. Critical optimization for 1000+ GPU training.

8 flashcards covering the most exam-critical facts. Click any card to reveal the answer.

Click a card to flip it · Click again to flip back

Flashcard 1

DGX H100 power draw and cooling requirement

~10.2 kW per DGX H100 (8× H100 SXM5 + CPUs + fans). Rack with 2 DGX = ~21 kW. Supports air cooling AND direct liquid cooling (DLC). B200-based systems require DLC.

Flashcard 2

RoCE vs InfiniBand — what makes RoCE "converged"?

RoCE = RDMA over Converged Ethernet — runs RDMA on standard Ethernet infrastructure. Requires lossless Ethernet (Priority Flow Control + ECN). Lower switch cost than IB but needs careful QoS configuration.

Flashcard 3

BlueField DPU — three main offload categories

1. Networking (RDMA, routing, packet processing).
2. Storage (NVMe-oF initiator/target).
3. Security (encryption, microsegmentation, zero-trust).
All done on DPU ARM cores, freeing CPU for applications.

Flashcard 4

SHARP — what does it stand for and what does it do?

Scalable Hierarchical Aggregation and Reduction Protocol. Performs all-reduce inside InfiniBand switches rather than at end nodes. Reduces network traffic proportionally with node count.

Flashcard 5

PUE — formula and what "good" looks like for AI

PUE = Total Facility Power ÷ IT Equipment Power.
Ideal = 1.0. Good AI DC = 1.1–1.3. High GPU density + liquid cooling achieves lower PUE than air-cooled dense deployments.

Flashcard 6

Direct Liquid Cooling (DLC) vs air cooling — when required?

Air cooling: adequate up to ~30 kW/rack. DLC (cold plates on chips): needed for 30–60+ kW; required for B200 at full TDP. DLC achieves much higher PUE by removing heat directly at the source.

Flashcard 7

NVIDIA Quantum-2 switch — key spec

400Gb/s NDR per port. Up to 64 ports per switch (QM9700). Non-blocking fat-tree topology. Supports SHARP in-network aggregation. Used in DGX POD/SuperPOD.

Flashcard 8

Out-of-band management network — purpose?

Separate 1GbE network for BMC/iDRAC (IPMI) access. Always powered even when server is off. Used for: remote power on/off, console access, firmware updates, health monitoring. Must never share with data plane.

Select your current level to get a tailored study plan for Data Center, Networking & DPUs.

Beginners — Build the Foundation

Learn GPU TDP values and how to calculate rack power (10 kW × servers = total compute kW)
Understand air vs liquid cooling basics — the cooling ladder from air to DLC to immersion
Study what RDMA means and why bypassing the CPU matters for distributed training
Familiarize yourself with InfiniBand as a networking technology and why it differs from Ethernet
Learn the three DPU offload categories: networking, storage, security

Official Documentation & Reference

NVIDIA Quantum InfiniBand Official product page for NVIDIA Quantum switches — NDR InfiniBand, SHARP, and DGX SuperPOD network fabric specifications.
NVIDIA BlueField DPU BlueField-3 DPU product overview — ARM cores, 400Gb/s connectivity, offload capabilities, and use cases for AI data centers.
NVIDIA DOCA SDK Developer portal for DOCA — the programming framework for BlueField DPUs, including APIs for networking, storage, and security services.
NVIDIA Spectrum-X (Ethernet for AI) Spectrum-X product page — RoCE-optimized 400GbE Ethernet networking combining Spectrum-4 switches and BlueField-3 DPUs for AI training.
ASHRAE Thermal Guidelines for Data Centers ASHRAE A3/A4 thermal class specifications — inlet temperature ranges and environmental standards for high-density GPU computing deployments.
NCA-AIIO Official Exam Page — NVIDIA Official NCA-AIIO certification page with exam objectives, registration, and study resources from NVIDIA.