Data Center Power & Cooling for AI

Modern AI infrastructure has transformed power and cooling from afterthoughts into primary design constraints. A single rack of NVIDIA GB200 NVL72 systems demands ~120 kW — enough to power a small building. Understanding these fundamentals is essential for every NCP-AII certified professional.

Why This Topic Matters: GPU TDPs have doubled every generation. H100 SXM5 at 700 W (vs A100 at 400 W) changed facility design requirements. The GB200 NVL72 at ~120 kW per rack makes traditional air cooling obsolete. Power and cooling are now the binding constraints for AI cluster scale-out.

⚡ Power Density Evolution

GPU / System	TDP	Notes
A100 SXM4	400 W	Air viable
H100 PCIe	350 W	Air viable
H100 SXM5	700 W	DLC preferred
B200 SXM	~1,000 W	DLC required
DGX H100	10.2 kW	8× H100 SXM5
GB200 NVL72	~120 kW	Liquid / direct

🏗️ Key Concepts at a Glance

Concept	Value / Rule
PUE (perfect)	1.0 — impossible in practice
PUE (hyperscale)	1.1 – 1.2
PUE (average DC)	1.4 – 1.6
Air cooling max	~25–30 kW/rack
DLC capacity	50–120 kW/rack
Immersion capacity	100+ kW/tank
kW vs kVA	kW = real; kVA = apparent (UPS)

🔗 Complete Power Chain

Every watt flowing into a GPU travels through this chain — each stage adds loss and latency to power delivery:

⚡ Grid

→

🏭 Transformer

→

🔀 ATS

→

🔋 Generator

→

⚡ UPS

→

📦 PDU

→

🖥️ Rack PDU

→

💻 PSU

→

🎮 GPU

ATS = Automatic Transfer Switch (utility ↔ generator). UPS bridges the ~10–30 second gap until generator starts. Each conversion step reduces efficiency — this is why 80 Plus certification matters for PSUs.

📐 PUE Formula

PUE = Total Facility Power ÷ IT Equipment Power

A PUE of 1.2 means 20% overhead — for every 100 kW of IT load, 20 kW is lost to cooling, lighting, UPS, and distribution losses.

🔌 80 Plus Efficiency

PSU efficiency ratings at 50% load:

Tier	Efficiency
Titanium	96%
Platinum	94%
Gold	90%
Silver	88%
Bronze	82%

🌡️ ASHRAE Classes (Inlet Temp)

Class	Inlet Temp Range
A1	15–32°C (most stringent)
A2	10–35°C
A3	5–40°C
A4	5–45°C (most relaxed)

Power Fundamentals

Understanding electrical concepts — PUE, kW vs kVA, power chains, and PSU efficiency — is foundational for AI data center planning.

📐 PUE — Power Usage Effectiveness

PUE = Total Facility Power / IT Equipment Power

Perfect = 1.0 (no overhead) · Hyperscale target = 1.1–1.2 · Global average ≈ 1.58 (2023 Uptime Institute)

PUE measures how efficiently a data center uses power. An IT load of 1,000 kW with PUE 1.5 requires 1,500 kW from the grid — 500 kW lost to cooling, UPS losses, lighting, and power distribution.

Strategies to reduce PUE: hot/cold aisle containment, outside air economization, liquid cooling (removes heat at source, no chiller needed), higher ASHRAE inlet temp setpoints (reduces chiller work), and on-site renewable generation.

⚡ kW vs kVA

kW (kilowatts) = Real Power — the actual power consumed and converted to work (heat, computation). This is what your electricity bill measures and what GPU TDP specifications use.

kVA (kilovolt-amperes) = Apparent Power — the product of RMS voltage × RMS current. Always ≥ kW. UPS units, PDUs, and generators are rated in kVA because they must handle the full current draw regardless of power factor.

Power Factor (PF) = kW / kVA. Modern server PSUs achieve PF ≈ 0.99. Older equipment may have PF 0.6–0.8, requiring oversized UPS capacity.

Exam Trap: GPU TDPs and server power consumption are quoted in kW (real power). UPS and PDU ratings are in kVA. Never compare them directly without accounting for power factor.

🔋 UPS Types

Online Double-Conversion (most common in AI DCs): Always running through inverter/rectifier — zero transfer time, best isolation from grid disturbances. Required for sensitive GPU compute clusters.

Line-Interactive: Uses tap-changing transformer; 2–10 ms transfer time. Acceptable for non-critical loads.

Standby (offline): Switches on failure; 4–25 ms transfer. Not suitable for AI infrastructure.

UPS provides power during the generator start sequence — typically 10–30 seconds. Battery runtime is sized accordingly (not for extended outages — that's what generators are for).

N+1 / 2N redundancy: N+1 = one extra UPS module; 2N = full second UPS system (required for Tier III/IV facilities).

🏭 Power Chain — Deep Dive

Stage	Component	Function	Typical Loss
1	Utility Grid	High-voltage AC supply (typically 11–33 kV)	—
2	Step-Down Transformer	Reduces to 480V or 208V for facility distribution	1–2%
3	ATS (Automatic Transfer Switch)	Switches between utility and generator; <100 ms transfer	<0.5%
4	Generator	Diesel/gas backup; provides power during utility outage	—
5	UPS	Bridges generator start time; double-conversion = ~95% efficient	4–6%
6	Main PDU	Distributes and monitors power to floor rows	1–2%
7	Rack PDU	Per-outlet metering; often A+B (redundant) feed	<1%
8	Server PSU	AC→DC conversion; 80 Plus Titanium = 96% efficient at 50% load	4–18%
9	VRM (Voltage Regulator Module)	Final DC regulation to GPU cores (~1V)	3–5%

🔌 80 Plus PSU Efficiency — Why It Matters

At scale, PSU efficiency is a significant operational cost. A DGX SuperPOD with 32 × DGX H100 nodes at 10.2 kW each:

DGX SuperPOD PSU Loss Comparison

Total IT load32 × 10.2 kW = 326.4 kW

Gold PSU (90% efficient) — loss326.4 × 0.11 = 35.9 kW wasted

Titanium PSU (96% efficient) — loss326.4 × 0.042 = 13.7 kW wasted

Titanium saves vs Gold~22 kW = ~$19,000/year at $0.10/kWh

⚡ GPU TDP Reference Table

GPU	Form Factor	TDP	Cooling Implication
A100	SXM4	400 W	Air cooling viable with high-performance CRAH
A100	PCIe	300 W	Standard server air cooling
H100	SXM5	700 W	DLC strongly preferred; air at limit
H100	PCIe	350 W	Air cooling viable
H200	SXM5	700 W	Same die as H100; HBM3e difference
B200	SXM	~1,000 W	Direct liquid cooling required
DGX H100	Full system	10.2 kW	8× H100 SXM5; specialized rack cooling
GB200 NVL72	Full rack	~120 kW	Factory-integrated direct liquid cooling

Cooling Technologies

Three primary paradigms exist for cooling AI infrastructure: air, direct liquid cooling (DLC), and immersion. Each has distinct capacity limits, costs, and deployment tradeoffs.

🌬️ Air Cooling

≤30 kW/rack

Traditional CRAC/CRAH units with hot/cold aisle containment. Viable for A100 and H100 PCIe at 300–400 W TDP.

CRAC: Computer Room AC — DX refrigerant cooling
CRAH: Computer Room Air Handler — uses chilled water from central chiller plant
Hot/cold aisle containment: segregates airflows, improves delta-T efficiency
Economizer mode: uses outside air when ambient temp is low enough
PUE: typically 1.3–1.6 with chilled water CRAH
CapEx: lowest of all cooling methods
Limit: ~25–30 kW/rack before hot-spot formation

💧 Direct Liquid Cooling (DLC)

50–120 kW/rack

Cold plates on CPU/GPU + liquid manifolds in rack. H100 SXM5 and B200 require DLC for sustained workloads.

Cold plates: metal plates with internal channels; attach directly to GPU/CPU die
Liquid: typically 30–45°C supply water (warm water cooling possible)
Rear-door heat exchanger (RDHx): attaches to back of rack, captures hot exhaust air
CDU (Coolant Distribution Unit): manages coolant flow, pressure, temperature per rack
PUE: 1.05–1.15 with warm water cooling (no chiller at mild ambient temps)
GB200 NVL72: factory-integrated DLC, shipped as complete rack unit
Requires building liquid infrastructure: manifolds, piping, leak detection

🛁 Immersion Cooling

100+ kW/tank

Servers submerged in dielectric fluid. Highest density, near-perfect heat transfer. Two variants: single-phase and two-phase.

Single-phase: fluid stays liquid; circulated through external heat exchanger
Two-phase: fluid boils on hot components; vapor condenses on coils (higher efficiency)
Fluid: engineered dielectric (non-conductive) — e.g., mineral oil, engineered fluorinerts
PUE: as low as 1.02–1.05 (near-perfect heat capture)
NVIDIA validation: specific fluids and dipping times approved per GPU model
Drawbacks: upfront cost, fluid management complexity, limited tooling access
Enables overclocking / sustained boost clocks not possible in air

🔄 Hot/Cold Aisle Containment — How It Works

Without containment, supply air mixes with exhaust air before reaching equipment intakes — causing cooling inefficiency and hot spots.

Cold aisle containment: Encloses the cold aisle with doors and ceiling panels. Server intakes pull cold air exclusively from the contained space. More common.

Hot aisle containment: Encloses the hot exhaust aisle, channeling hot air directly to CRAH return plenum. Reduces risk of hot air recirculation into adjacent aisles.

Effective containment can improve cooling efficiency by 20–30%, allowing higher rack densities with existing infrastructure.

Recommended airflow design:

Raised floor: cold air delivered via perforated tiles beneath racks
Blanking panels: fill empty rack spaces to prevent air bypass
Supply temperature: 18–22°C cold aisle target
Return temperature: 35–45°C hot aisle (higher = more efficient chiller operation)
Variable speed fans: match airflow to actual heat load

ASHRAE A1 limit: 15–32°C inlet temperature at equipment intake. Most NVIDIA GPUs require A1 compliance for full-speed sustained operation.

📊 Cooling Method Comparison

Method	Rack Density	Typical PUE	CapEx	Water Use	Best For
Air (CRAC/CRAH)	≤30 kW	1.3–1.6	Low	Minimal	A100, H100 PCIe
Air + Economizer	≤30 kW	1.1–1.3	Medium	Low	Moderate climates
Rear-door HX (RDHx)	30–60 kW	1.1–1.2	Medium	Low	Retrofit/hybrid
Direct Liquid (cold plate)	50–120 kW	1.05–1.15	High	Low–moderate	H100/B200 SXM, DGX
Immersion (single-phase)	100+ kW	1.03–1.08	Very high	None	Extreme density
Immersion (two-phase)	100+ kW	1.02–1.05	Very high	None	Maximum efficiency

Thermal Management

ASHRAE thermal guidelines define safe operating envelopes for IT equipment. Understanding these classes and GPU thermal throttling behaviors is critical for sustained AI workload performance.

🌡️ ASHRAE Thermal Classes — Detailed

ASHRAE (American Society of Heating, Refrigerating and Air-Conditioning Engineers) defines standardized inlet temperature and humidity ranges that IT equipment must tolerate. Classes apply to temperature measured at the equipment intake, not the room ambient.

Class	Inlet Temp (°C)	Max Humidity	Typical Equipment	Stringency
A1	15 – 32°C	80% RH, 17°C dew point	Mission-critical, enterprise servers	Most Stringent
A2	10 – 35°C	80% RH, 21°C dew point	Standard servers, networking	Moderate
A3	5 – 40°C	85% RH, 24°C dew point	Ruggedized/industrial servers	Relaxed
A4	5 – 45°C	90% RH, 24°C dew point	Hardened/outdoor deployments	Most Relaxed

Key Exam Fact: A1 is the most stringent (narrowest, coolest range). A4 is the most relaxed (widest, warmest range). Most NVIDIA enterprise GPUs require A1 compliance for sustained full performance.

🔥 GPU Thermal Throttling

NVIDIA GPUs implement hardware thermal protection through two mechanisms:

Power Throttling (SW thermal threshold): GPU reduces power consumption before hitting thermal limit. Maintains operation at reduced performance. Triggered ~83–85°C for most data center GPUs.

Hardware Thermal Shutdown: GPU halts if junction temperature exceeds safe limit (~90–95°C for H100). Requires system reboot.

nvidia-smi monitoring:

nvidia-smi dmon -s pucvmet -d 5
# Shows: power draw, utilization, clock speeds, temp
# Throttle reason codes appear in 'violations' column
nvidia-smi -q -d PERFORMANCE
# Detailed throttle reason: thermal, power, sync boost, etc.

💨 Airflow Design Principles

Front-to-back airflow: Industry standard. Cold air enters front bezel, hot air exhausts rear. Critical for hot/cold aisle containment compatibility.

CFD (Computational Fluid Dynamics): Used in facility planning to model airflow, identify hot spots, and optimize CRAH placement before physical deployment.

Delta-T target: The temperature rise from cold aisle to hot aisle should be 10–20°C. Too small = overcooling (wasted energy). Too large = potential hot spots.

Airflow-matching: NVIDIAs high-density GPU systems require server fans rated to overcome high static pressure — important for dense GPU configurations where GPU cards restrict internal airflow.

📊 GPU Temperature Monitoring & Management

Metric	Tool	Normal Range	Action Required If
GPU Temperature (°C)	`nvidia-smi -q -d TEMPERATURE`	<80°C sustained	>85°C — check cooling, airflow
GPU Power Draw (W)	`nvidia-smi -q -d POWER`	Near TDP at load	<TDP under full load = throttling
Memory Temperature	`nvidia-smi -q -d TEMPERATURE`	<85°C (HBM)	>90°C — HBM thermal design issue
Fan Speed	`nvidia-smi -q -d FAN`	Auto-managed	100% sustained = cooling deficiency
Throttle Reasons	`nvidia-smi -q -d PERFORMANCE`	None active	Any thermal throttle = escalate
DCGM Health	DCGM health-check	Pass all	Failures → GPU replacement ticket

🔧 Thermal Management Best Practices

Facility Level:

Maintain cold aisle inlet at 18–27°C (within A1 range)
Use DCIM (Data Center Infrastructure Management) tools for real-time thermal mapping
Hot aisle temperature ≤45°C to protect CRAH efficiency
Install temperature sensors at intake, mid-rack, and exhaust
All blanking panels installed; no empty rack positions uncovered

Server/GPU Level:

Verify TIM (Thermal Interface Material) integrity at GPU installation
For DLC: verify cold plate seating torque per NVIDIA spec
Monitor coolant flow rate and supply/return temperature differential
Set appropriate power limits with nvidia-smi -pl <watts>
Enable DCGM health monitoring for continuous GPU thermal telemetry

AI Data Center Design

Designing power and cooling infrastructure for AI clusters requires bottom-up power budgeting, PUE-adjusted facility planning, and redundancy architecture matched to the scale of deployment.

🏗️ DGX SuperPOD Power Planning Example

A DGX H100 SuperPOD is the reference AI cluster design. Let's build the complete power budget:

DGX H100 SuperPOD — Full Power Budget

GPU nodes32 × DGX H100

IT load per node10.2 kW

Raw IT load (compute)32 × 10.2 = 326.4 kW

Networking (IB NDR switches)~30 kW

Storage (NVMe/Lustre tier)~20 kW

Total IT Equipment Power~376 kW

Target PUE1.2 (DLC-assisted)

Total Facility Power Required376 × 1.2 = ~451 kW

GPU count: 32 × 8 = 256 H100 GPUs. Theoretical FP8 throughput: 256 × 3,958 TFLOPS ≈ 1,013 PFLOPS ≈ ~1 ExaFLOP FP8.

📐 GB200 NVL72 Power Planning

The GB200 NVL72 is a single rack unit containing 72 B200 GPUs + 36 Grace CPUs. It represents the extreme of current AI infrastructure density.

Single GB200 NVL72 Rack

Rack IT power~120 kW

Cooling methodFactory DLC (mandatory)

PUE achievable1.05–1.1

Facility power per rack120 × 1.07 ≈ 128 kW

Compare to H100 SXM5: 120 kW ÷ 700 W = 171 GPUs worth of H100 compute in a single GB200 NVL72 rack.

Infrastructure implications:

Power circuits: requires dedicated high-amperage feeds (often 480V 3-phase)
Floor loading: DLC fluid adds weight; verify structural capacity
Chilled water supply: CDUs require facility chilled water or process cooling water loop
Leak detection: mandatory with in-rack liquid; sensors at every manifold
UPS: must handle 120 kW/rack × rack count — typically dedicated UPS modules per cluster
Generator sizing: AI clusters require N+1 or 2N generator capacity at full load

🔄 Power Redundancy Architecture

Tier	Architecture	Availability	Use Case
Tier I	Single path, no redundancy	99.671%	Dev/test environments
Tier II	N+1 redundant components	99.741%	Internal enterprise
Tier III	Concurrently maintainable (N+1 paths)	99.982%	Commercial AI DCs
Tier IV	Fault tolerant (2N paths)	99.995%	Mission-critical AI

DGX H100 server PSU redundancy: Each DGX H100 ships with redundant PSUs. Best practice: connect A-feed and B-feed from separate PDUs on separate UPS/generator chains.

🌱 Sustainability Metrics for AI Clusters

WUE (Water Usage Effectiveness): Liters of water per kWh of IT load. Chiller-based cooling consumes significant water for evaporative towers. DLC with dry coolers eliminates most water consumption.

CUE (Carbon Usage Effectiveness): kg CO₂ per kWh of IT load. Driven by local grid carbon intensity. On-site solar/wind reduces CUE.

ERE (Energy Reuse Effectiveness): Measures how much waste heat is recovered and reused (e.g., warming buildings). DLC enables waste heat recovery at useful temperatures (40–60°C supply water).

NVIDIA's Sustainability Focus:

GB200 NVL72: DLC from factory minimizes PUE overhead
NVLink switching reduces inter-GPU traffic (vs PCIe) saving switch power
MIG on A100/H100 improves GPU utilization (less idle power)
FP8/FP4 precision: more work per watt vs FP32
DCGM power capping: enforce cluster-wide power limits without sacrificing SLA

Practice Quiz

10 questions covering power fundamentals, cooling technologies, ASHRAE classes, and AI cluster power planning. Select your answer to reveal the explanation.

Question 1 of 10

A data center consumes 500 kW of total facility power to support 400 kW of IT equipment power. What is its PUE?

A1.25

B0.80

C1.50

D2.00

PUE = Total Facility Power ÷ IT Equipment Power = 500 ÷ 400 = 1.25. Option B (0.80) is <1.0 which is physically impossible — you can never use less power than your IT equipment requires. A PUE of 1.25 indicates 25% overhead for cooling and distribution.

Question 2 of 10

Which ASHRAE thermal class has the MOST stringent (narrowest and coolest) inlet temperature requirements for IT equipment?

AA4

BA1

CA2

DA3

A1 is the most stringent: 15–32°C inlet range. Classes increase in temperature tolerance: A1 → A2 → A3 → A4 (5–45°C, most relaxed). Most NVIDIA enterprise GPUs require A1 compliance for sustained full-performance operation. A4 is the most relaxed class (widest range, highest temps) — often used for industrial/ruggedized deployments.

Question 3 of 10

What is the TDP of the NVIDIA H100 SXM5 GPU?

A400 W

B350 W

C700 W

D1,000 W

H100 SXM5 TDP = 700 W. Compare: A100 SXM4 = 400 W, H100 PCIe = 350 W, B200 SXM ≈ 1,000 W. The H100 SXM5's 700 W TDP is why direct liquid cooling is strongly preferred for DGX H100 deployments — 8 × 700 W = 5,600 W in GPUs alone, plus CPUs, memory, storage, and networking to reach 10.2 kW total DGX power.

Question 4 of 10

A data center deploys 32 DGX H100 nodes (10.2 kW each) with a PUE of 1.2. Approximately how much total facility power is required?

A~391 kW

B~326 kW

C~652 kW

D~783 kW

IT Load = 32 × 10.2 kW = 326.4 kW. Facility power = IT load × PUE = 326.4 × 1.2 = ~391 kW. Option B (326 kW) is the raw IT load before PUE overhead. This is a common calculation in the NCP-AII exam — always multiply IT load by PUE to get total facility power requirement.

Question 5 of 10

Which cooling method supports the HIGHEST rack power density?

AImmersion cooling

BHot/cold aisle air cooling

CRear-door heat exchanger

DDirect liquid cooling (cold plates)

Immersion cooling supports the highest density: 100+ kW per tank. Servers are fully submerged in dielectric fluid — heat transfer is nearly perfect. Air cooling tops out at ~25–30 kW/rack. Rear-door HX reaches 30–60 kW. Direct liquid cooling (cold plates) handles 50–120 kW/rack. Immersion exceeds all alternatives for raw density.

Question 6 of 10

UPS systems in AI data centers are typically rated in kVA rather than kW. What is the primary reason?

AkVA is a more precise measurement of power consumption

BUPS must handle full current draw regardless of power factor

CkVA ratings are always lower, making UPS appear more capable

DGPU TDPs are specified in kVA

UPS must handle apparent power (kVA = voltage × current) because it must supply the full current, regardless of whether that current is in phase with voltage (power factor). kVA ≥ kW always. kW = kVA × power factor (PF). Modern server PSUs achieve PF ~0.99, so kW ≈ kVA. Older equipment with PF 0.6–0.8 requires substantially oversized UPS capacity — hence UPS ratings in kVA capture the worst case.

Question 7 of 10

What PSU efficiency tier does 80 Plus Titanium certification guarantee at 50% load?

A90%

B94%

C96%

D99%

80 Plus Titanium guarantees 96% efficiency at 50% load. The full hierarchy: Bronze = 82%, Silver = 88%, Gold = 90%, Platinum = 94%, Titanium = 96%. At scale (hundreds of kW of IT load), the 6% difference between Gold and Titanium translates to tens of kilowatts of wasted heat and tens of thousands of dollars in annual energy cost.

Question 8 of 10

Which component in the data center power chain bridges the time gap between a utility outage and generator startup?

AATS (Automatic Transfer Switch)

BPDU (Power Distribution Unit)

CUPS (Uninterruptible Power Supply)

DStep-down Transformer

The UPS bridges the generator start gap. Diesel generators typically take 10–30 seconds to start and reach stable frequency/voltage. During this window, the UPS (with battery reserve) maintains power to IT equipment with zero interruption. The ATS switches the circuit from utility to generator once the generator is stable, but it does not store energy — that's the UPS's role.

Question 9 of 10

The GB200 NVL72 system draws approximately 120 kW per rack. Which cooling approach does NVIDIA use for this system?

AHigh-velocity air cooling with CRAH units

BRear-door heat exchanger attached post-installation

CFactory-integrated direct liquid cooling

DSingle-phase immersion in customer-provided tank

GB200 NVL72 ships with factory-integrated direct liquid cooling — it is a complete rack unit delivered ready to connect to facility chilled water supply. At 120 kW/rack, air cooling is completely infeasible (max ~30 kW/rack). NVIDIA pre-installs cold plates, manifolds, and CDU as part of the rack — customers only need to connect to the facility liquid cooling infrastructure.

Question 10 of 10

Hot/cold aisle containment primarily achieves which outcome?

AIncreases the maximum GPU TDP supported per server

BPrevents mixing of supply and exhaust air, improving cooling efficiency

CEliminates the need for UPS systems in modern data centers

DConverts AC power to DC more efficiently at the rack level

Hot/cold aisle containment prevents supply air and exhaust air from mixing before reaching server intakes. Without containment, hot exhaust air recirculates into cold aisles, raising intake temperatures and forcing CRAH units to work harder (higher PUE). Effective containment can improve cooling efficiency by 20–30%, allowing higher rack densities with existing CRAH infrastructure — without changing TDP, power, or UPS.

0/10

Questions Correct

Review the explanations above for any missed questions.

Memory Hooks & Advisor

Mnemonics, patterns, and quick-reference guidance to lock in the most exam-critical power and cooling concepts.

🔢

PUE Is Always ≥ 1.0

PUE = Total / IT. You can never use less total power than IT power alone. Hyperscale target: 1.1–1.2. If a question shows PUE < 1.0, it's wrong.

Think: "1.0 = perfect, 1.5 = wasteful"

🌡️

ASHRAE: Lower = Stricter

A1 is the most stringent (15–32°C — narrowest, coolest). A4 is the most relaxed (5–45°C — widest, hottest). The NUMBER goes up as strictness goes DOWN.

"A1 = Aerospace precision; A4 = All-weather rugged"

💧

Cooling Density Ladder

Air: ≤30 kW/rack → DLC (cold plate): 50–120 kW/rack → Immersion: 100+ kW/tank. Each step roughly 4× higher density than the previous.

"Air → Drip → Dunk" with 30 / 120 / unlimited

🔌

kW vs kVA

kW = REAL power (your bill, GPU TDP). kVA = APPARENT power (UPS/PDU ratings). kVA ≥ kW. PF × kVA = kW. Modern servers: PF ≈ 0.99, so nearly equal.

"kVA = kW's Bigger Sibling" — always UPS/PDU ratings

⚡

GPU TDP Numbers

A100 SXM=400W, H100 PCIe=350W, H100 SXM=700W, B200≈1,000W. DGX H100=10.2 kW (8 GPUs + system). GB200 NVL72=~120 kW (72 GPUs entire rack).

"3-7-10-120: PCIe, SXM, DGX, NVL72"

🔋

UPS Bridges the Gap

Power chain: Grid → Transformer → ATS → Generator → UPS → PDU → Server. UPS exists to bridge the 10–30 second generator start window — NOT for extended outages. ATS switches, UPS sustains.

"Generator = marathon; UPS = sprint bridge"

🏆

80 Plus Titanium = 96%

Bronze=82%, Silver=88%, Gold=90%, Platinum=94%, Titanium=96% at 50% load. Each step is ~2-4% better. At scale, Gold→Titanium saves tens of kW.

"Titanium = Top tier, T=96"

🌊

GB200 NVL72 = 120 kW

72 B200 GPUs + 36 Grace CPUs in one rack = ~120 kW. Air cooling impossible. NVIDIA ships with factory DLC. Connect to facility chilled water. No add-on cooling needed.

"72 GPUs, 120 kW, factory-plumbed"

🃏 Flashcards — Click to Flip

Formula

PUE Formula

Tap to reveal

Answer

PUE = Total Facility Power ÷ IT Equipment Power
Perfect = 1.0 | Hyperscale = 1.1–1.2

Spec

H100 SXM5 TDP

Tap to reveal

Answer

700 W (DLC preferred)
vs H100 PCIe = 350 W, B200 ≈ 1,000 W

Standard

ASHRAE A1 Inlet Range

Tap to reveal

Answer

15 – 32°C
Most stringent class (narrowest, coolest)

Limit

Air Cooling Max Rack Density

Tap to reveal

Answer

~25–30 kW/rack
Above this → hot spots form; DLC required

Spec

GB200 NVL72 Power Draw

Tap to reveal

Answer

~120 kW per rack
72× B200 + 36× Grace CPU
Factory-integrated DLC

Spec

DGX H100 System TDP

Tap to reveal

Answer

10.2 kW total system
8× H100 SXM5 (700 W each) + CPU/mem/NVMe/net

Efficiency

80 Plus Titanium % at 50% Load

Tap to reveal

Answer

96% efficient
Gold=90%, Platinum=94%, Titanium=96%

Concept

What Does the UPS Do?

Tap to reveal

Answer

Bridges 10–30 sec generator start time during utility outage. Double-conversion type provides zero transfer time and best power quality.

🤖 Expert Advisor — Ask a Category

Power Fundamentals

Cooling Technology Selection

Thermal Management

AI DC Infrastructure Design

Power Chain & Reliability

⚡ Power Fundamentals

PUE = Total Facility Power ÷ IT Equipment Power. Perfect = 1.0 (unachievable). Hyperscale leaders target 1.1–1.2 with DLC and economizers.
kW (kilowatts) is real power — what GPUs draw, what your bill measures. kVA is apparent power — what UPS and PDUs are rated in. kVA ≥ kW always; PF = kW/kVA.
80 Plus Titanium = 96% PSU efficiency at 50% load. At cluster scale, choosing Titanium over Gold saves 6% of PSU losses — tens of kW in a SuperPOD.
GPU TDP is a thermal design point, not a guaranteed maximum draw. Actual draw varies with workload — training at FP8 sustained often approaches TDP. Idle = much lower (40–60 W).
Power budget formula: Total Facility Power = IT Load × PUE. IT load includes GPUs, CPUs, memory, storage, networking — not just GPU TDP alone.
DGX H100 SuperPOD: 32 × 10.2 kW = 326.4 kW IT load. At PUE 1.2 = ~391 kW facility power total.

💧 Cooling Technology Selection

Air cooling limit: ~25–30 kW/rack. Sufficient for A100 PCIe (300 W) and H100 PCIe (350 W) in standard server configurations with <4 GPUs.
H100 SXM5 (700 W): DLC strongly preferred. 8 × 700 W = 5,600 W in GPUs alone. DGX H100 at 10.2 kW exceeds practical air cooling rack limits.
DLC (cold plates): 50–120 kW/rack. Requires facility chilled water infrastructure: supply/return pipes, CDU per rack, leak detection. PUE 1.05–1.15.
Rear-door heat exchanger (RDHx): attaches to existing racks; captures hot exhaust air in a liquid-cooled door. Retrofit-friendly but lower capacity than full cold plates.
Immersion: 100+ kW/tank. Highest density and PUE (1.02–1.05). Requires NVIDIA validation of dielectric fluid type and immersion duration per GPU model.
GB200 NVL72 (~120 kW): ships with factory-integrated DLC. Customer connects facility chilled water — no additional cooling hardware selection required.

🌡️ Thermal Management

ASHRAE A1 (15–32°C) is the most stringent class — required for mission-critical enterprise GPUs. A1 → A2 → A3 → A4: strictness decreases, allowed temperature range widens.
GPU thermal throttling begins at ~83–85°C junction temperature. Hardware shutdown occurs at ~90–95°C. Monitor with nvidia-smi -q -d TEMPERATURE.
nvidia-smi -q -d PERFORMANCE shows active throttle reasons: thermal, power, sync boost, board limit — each has a distinct root cause and remediation.
Cold aisle target: 18–27°C. Hot aisle: ≤45°C. Delta-T of 10–20°C across the server is normal. >20°C delta suggests inadequate airflow volume.
Blanking panels are mandatory — uncovered rack slots allow hot exhaust to recirculate into cold aisle intakes, raising effective GPU inlet temperature.
DCGM (Data Center GPU Manager) provides continuous health monitoring: temperature, power, utilization, ECC errors — essential for production AI cluster operations.

🏗️ AI DC Infrastructure Design

Power budget process: count all IT loads (GPU nodes + networking + storage + management), multiply by PUE, add growth headroom (20–30%), size UPS and generators accordingly.
DGX H100 SuperPOD: 32 nodes × 10.2 kW + ~50 kW networking/storage = ~376 kW IT. At PUE 1.2 = ~451 kW facility power. 256 H100 GPUs, ~1 ExaFLOP FP8.
GB200 NVL72 at ~120 kW/rack: requires dedicated high-amperage 3-phase circuits (480V typical), facility chilled water, structural floor loading assessment, in-rack leak detection.
Sustainability metrics: WUE (water usage per kWh IT), CUE (CO₂ per kWh IT), ERE (energy reuse effectiveness). DLC with dry coolers eliminates most water use; enables waste heat recovery.
Power redundancy: Tier III = N+1 paths (concurrently maintainable); Tier IV = 2N paths (fault tolerant). AI production clusters typically require Tier III minimum.
Dual PSU feeds: connect DGX A-PSU and B-PSU to separate PDU chains on independent UPS/generator paths — critical for maintaining GPU cluster availability during single-chain failures.

🔗 Power Chain & Reliability

Power chain order: Grid → Transformer → ATS → Generator → UPS → PDU → Rack PDU → Server PSU → VRM → GPU. Each hop has losses — total efficiency product determines delivered power.
ATS (Automatic Transfer Switch): switches between utility and generator in <100 ms. Does not store energy — that's the UPS's job. ATS transfers; UPS sustains.
Generator start time: 10–30 seconds typical. UPS battery runtime must cover this window plus margin. UPS batteries are not sized for extended outages.
Online double-conversion UPS: always running through inverter/rectifier = zero transfer time, best power quality, best protection for GPU compute. Required for AI infrastructure.
N+1 redundancy: one extra UPS module or generator beyond what is needed. 2N redundancy: complete second independent power chain (highest cost, highest availability).
PSU loss at scale: Titanium (96%) vs Gold (90%) across 326 kW IT load = ~22 kW difference = ~$19,000/year at $0.10/kWh — a significant OPEX justification for premium PSUs.

Unlock Full Flashcard Deck on FlashGenius →