What is the InfiniBand Subnet Manager and what does it do?

The Subnet Manager (SM) is a software service that discovers the entire IB fabric, assigns Local Identifiers (LIDs) to every port, computes all forwarding tables, and programs them into every switch. Without an active SM the fabric cannot route any traffic. OpenSM is the open-source SM; MLNX-OS includes a built-in SM for Spectrum switches.

What is adaptive routing in InfiniBand and when should you enable it?

Adaptive routing dynamically reroutes flows to less-congested paths in real time, unlike static routing which fixes paths at fabric initialization. It is enabled at the switch level and works without SM involvement. Enable it in fat-tree and rail-optimized AI cluster fabrics where traffic is bursty and AllReduce patterns create transient hotspots.

What causes InfiniBand credit starvation and how is it fixed?

Credit starvation occurs when a sender runs out of flow-control credits waiting for the receiver to return buffer credits. In AI fabrics this is typically caused by head-of-line blocking across Virtual Lanes (VLs). Fix: ensure SL-to-VL mapping separates traffic classes; enable Enhanced Congestion Control (ECN) to back off senders before buffers fill.

What does ibdiagnet check and what does ibstat show?

ibdiagnet is a comprehensive fabric diagnostic tool — it discovers topology, checks cabling, reports error counters on every port, and detects partitioning issues. ibstat shows the local HCA port state (port GUID, LID, SM LID, link speed/width, physical state). Together they cover fabric-wide health vs local port health.

What RDMA tuning parameters most impact InfiniBand AI training throughput?

Key RDMA tuning parameters: (1) MTU — use 4096 bytes for large AI messages; (2) TX/RX queue depth — increase to 1024+ for deep pipelines; (3) Completion Queue (CQ) moderation — batch completions to reduce interrupt overhead; (4) enable NVIDIA SHARP for in-network AllReduce; (5) disable CPU power-saving states (C-states) to minimize latency jitter.

InfiniBand Configuration, Optimization & Troubleshooting — NVIDIA NCP-AIN Study Guide

Four Operational Pillars of InfiniBand

From fabric bring-up through performance tuning and fault isolation — the operational skills NCP-AIN tests.

🧠

Subnet Manager

Fabric discovery, LID assignment, routing table computation (FTLV), OpenSM configuration, master vs standby SM

Key daemonopensm

Config file/etc/opensm/opensm.conf

🔀

Routing & QoS

Fat-tree, adaptive routing, SL-to-VL mapping, Virtual Lanes, partition keys (P_Key), and QoS policies

Routing engines4+

Max VLsVL0–VL15 (16 total)

⚡

RDMA Tuning

MTU, queue depth, CQ moderation, GID index, SR-IOV, SHARP for in-network AllReduce, C-state disabling

MTU target4096 B

SHARP gainUp to 50% AllReduce speedup

🔬

Diagnostics & Troubleshooting

ibdiagnet, ibstat, ibping, perfquery, smpquery — counter-based fault isolation across link, switch, and fabric layers

Primary toolibdiagnet

Port countersSymbolErrors, RcvErrors…

Why InfiniBand Operations Matter for AI Clusters

A misconfigured SM routing algorithm can reduce AllReduce bandwidth by 30–40%. A single bad cable introducing symbol errors causes retransmissions that spread across the entire NCCL ring. The NCP-AIN exam tests your ability to configure, tune, and diagnose the IB fabric — not just describe its architecture.

IB Fabric Bring-Up Sequence

What happens when OpenSM starts

The SM must complete all five phases before any host-to-host traffic can flow.

🔎

Phase 1 — Discovery (Subnet Sweep)

SM sends SMPs (Subnet Management Packets) to enumerate all nodes, switches, and links. Builds a complete topology graph.

SMP-based

↓

🏷️

Phase 2 — LID Assignment

SM assigns a unique Local Identifier (LID, 1–49151) to every HCA port and switch port. LIDs are the IB routing addresses.

16-bit address

↓

🗺️

Phase 3 — Route Computation

SM runs the selected routing algorithm (fat-tree, MINHOP, DFSSSP, etc.) to build forwarding tables (Linear Forwarding Tables, LFT) for every switch.

LFT per switch

↓

📥

Phase 4 — Table Programming

SM writes all LFTs to every switch using directed-route SMPs. Each switch now knows which output port to use for any destination LID.

Programmed via SMP

↓

✅

Phase 5 — Activate (SM Active)

SM sets port states to ACTIVE. Hosts can now resolve GIDs to LIDs via SA (Subnet Administrator) queries and begin RDMA operations.

Traffic flows

Configuration & Optimization Deep Dive

The exact parameters and mechanisms the exam tests — with recommended values for AI cluster fabrics.

OpenSM Key Configuration Parameters

Parameter	Default / Recommended	What It Controls	AI Cluster Guidance
routing_engine	minhop → ftree	Routing algorithm used to compute LFTs	Use `ftree` for fat-tree topologies; `dfsssp` for irregular fabrics
sm_priority	0 (range 0–15)	Master SM election — higher priority wins	Set primary SM to 14, standby to 13. Never leave two SMs at same priority.
sweep_interval	10 seconds	How often SM re-scans fabric for changes	Keep at 10s; reduce to 5s for unstable fabrics during bring-up only
lmc	0 (LID Mask Control)	Number of LIDs assigned per port (2^lmc)	Set lmc=2 (4 LIDs per port) for multipath routing with MINHOP
qos	disabled	Enables SL-to-VL QoS mapping	Enable for fabrics mixing RDMA and management traffic on same links
log_file	/var/log/opensm.log	SM diagnostic log location	Monitor for "SM state change", "port error", and "rerouting" events
reassign_lids	0	Force LID reassignment on SM restart	Leave 0 in production — LID changes trigger fabric-wide rerouting

Routing Algorithms — Choose the Right One

Fat-Tree (ftree)

routing_engine ftree

Computes routes that maximize bisection bandwidth in a balanced fat-tree topology. Distributes flows across all uplinks equally — prevents hot spots on up-links.

✅ Use for:Standard two-tier or three-tier fat-tree AI cluster fabrics. The default choice for most NVIDIA DGX SuperPOD deployments.

MINHOP

routing_engine minhop

Computes minimum-hop paths. Simple and fast to compute. Uses LMC to provide multiple equal-cost paths when lmc > 0.

✅ Use for:Simple test fabrics or small clusters. Not recommended for large AI training fabrics — does not balance load across links as well as ftree.

DFSSSP

routing_engine dfsssp

Deadline-First Shortest-path Subnet Partitioning. Handles irregular and asymmetric topologies. Supports enhanced QoS and is more flexible than ftree.

✅ Use for:Irregular fabrics, mixed switch generations, or fabrics where some switches have failed and topology is no longer symmetric.

Adaptive Routing — SM-Independent Dynamic Load Balancing

Unlike SM-computed static routing, Adaptive Routing (AR) is enabled per-switch and reroutes individual flows in real time to avoid congested paths. In NVIDIA Quantum switches (QM9700/QM8700) AR is configured via ib ar enable in MLNX-OS. AR requires all switches in the fabric to be AR-capable and is essential for rail-optimized fabrics where bursty AllReduce traffic creates transient hotspots.

InfiniBand Congestion Control — IBCC Flow

How Enhanced Congestion Notification Works

IB congestion control prevents credit starvation without packet drops — unlike TCP which relies on loss for backpressure.

🌊

Congestion Point

Switch buffer exceeds threshold on an output port

→

📡

FECN Mark

Switch marks Forward ECN bit in packet header

→

📨

BECN Return

Receiver sends Backward ECN notice back to sender

→

🐢

Sender Rate Limit

Sender throttles injection rate; recovers gradually

Key difference from RoCE: IB uses credit-based flow control at the link layer — switches track buffer availability per VL. FECN/BECN augments this for end-to-end congestion. Packet loss from congestion should be essentially zero in a well-configured IB fabric.

Service Levels (SL) and Virtual Lanes (VL)

SL	Maps to VL	Traffic Type	Typical Use in AI Clusters
SL0	VL0	Data	NCCL AllReduce, RDMA Write — bulk GPU-to-GPU collective traffic
SL1	VL1	Management	SM packets (SMPs), SA queries — must never be blocked by data traffic
SL2	VL2	RoCE v2	Separate VL for storage traffic (GPFS, Lustre) to isolate from training traffic
SL5	VL5	SHARP	SHARP AllReduce aggregation traffic — requires dedicated VL for correct operation

RDMA Performance Tuning Parameters

MTU

Maximum Transfer Unit

Larger MTU reduces per-message header overhead for bulk AI traffic. IB supports 256 / 512 / 1024 / 2048 / 4096 bytes.

✅ Set MTU=4096 for AI training

tx_depth / rx_depth

Send/Receive Queue Depth

Deeper queues allow more in-flight RDMA operations, hiding latency on long-RTT paths across large fabrics.

✅ Set to 1024 for large clusters

CQ moderation

Completion Queue Interrupt Batching

Batch multiple completions into one interrupt event. Reduces CPU overhead from interrupt storms at high message rates.

✅ Enable; batch 32–64 events

C-states (idle CPU)

CPU Power Management

Deep C-states (C6, C7) add 10–100µs wake-up latency. For RDMA polling loops this causes jitter that disrupts tight AllReduce synchronization.

✅ Disable: cpupower idle-set -D 0

NVIDIA SHARP

In-Network AllReduce

Offloads AllReduce aggregation into IB switches, eliminating host-to-host data round trips. Requires SHARP-capable Quantum switches.

✅ Up to 50% AllReduce speedup

GID index

RoCE / IB GID Selection

Each HCA port has multiple Global Identifiers. GID index 0 is the link-local address; higher indices are RoCE v2 IP-based GIDs. NCCL must be configured with the correct GID index.

✅ NCCL_IB_GID_INDEX=3 (RoCE v2)

Key Commands — Configuration & Tuning

# Check HCA port state and speed
ibstat mlx5_0
  State: Active   Physical state: LinkUp   Rate: 400

# Set MTU to 4096 (IB datagram) on all ports
opensm.conf:  subnet_timeout  18

# Disable CPU C-states to reduce latency jitter
cpupower idle-set -D 0

# Check active SM and its GUID
smpquery nodeinfo 0    # query SM via lid 0

# Verify partition keys on a port
smpquery pkeys <lid> <port>

# Set NCCL to use IB with RoCE v2 GID index 3
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3

Key Concepts Compared

Side-by-side breakdowns to sharpen exam accuracy on easily confused concepts.

Routing Algorithms Compared

Property	Fat-Tree (ftree)	MINHOP	DFSSSP	Adaptive Routing
Configured in	opensm.conf	opensm.conf	opensm.conf	Switch (MLNX-OS)
Route type	Static (at init)	Static (at init)	Static (at init)	Dynamic (per-flow)
Best topology	Balanced fat-tree	Any simple fabric	Irregular/asymmetric	Any (complements static)
Load balancing	✅ Excellent (ECMP)	⚠️ Moderate	✅ Good	✅ Best (real-time)
Requires SHARP?	No	No	No	No (independent)
AI cluster default	✅ Yes	❌ No	Fallback	✅ Strongly recommended

SM Master vs Standby vs Rogue SM

State	Master SM	Standby SM	Rogue SM
Count per fabric	Exactly 1	0 or more	0 (unwanted)
Programs LFTs?	✅ Yes	❌ No (monitors only)	⚠️ Yes — causes chaos
Elected by	Highest sm_priority wins	Lower priority defers	Unconfigured SM on a host
Failover	—	Promotes to master if master fails	May become master unexpectedly
Detection	opensm.log "Master"	opensm.log "Standby"	`ibdiagnet` reports multiple SMs
Fix if rogue	—	—	Disable `opensm` on rogue host immediately

Diagnostic Commands Compared

Command	Scope	What It Shows	When to Use
ibdiagnet	Fabric-wide	Topology, errors on all ports, cabling, partition issues	First step — comprehensive fabric health scan
ibstat	Local HCA port	Port state, LID, SM LID, link speed, physical state	Verify HCA is up and at correct speed before NCCL run
perfquery	Single port	Error counters: SymbolErrors, RcvErrors, XmtDiscards	Deep-dive on a specific suspect port/switch
ibping	End-to-end	Round-trip latency to a destination LID or GID	Test connectivity and measure baseline latency
smpquery	Fabric element	Raw SMP attributes: LFT entries, port info, P_Keys, SM info	Verify SM configuration was applied correctly
iblinkinfo	Fabric-wide	All link speeds and widths across the fabric	Find links running at wrong speed (e.g., HDR instead of NDR)

SHARP vs Standard AllReduce

Property	Standard IB AllReduce	SHARP In-Network AllReduce
Where aggregation runs	On host CPUs (via NCCL)	Inside IB switches (dedicated tree)
Data traversal	All data goes host → switch → host multiple times	Data aggregated at each switch level, only result returns to hosts
Bandwidth requirement	O(N) — grows with node count	O(1) — switch-local aggregation
Latency reduction	Baseline	Up to 50% for large AllReduce
GPU utilization impact	GPUs idle during AllReduce	GPUs partially freed during in-switch aggregation
Requirements	Any IB switch	NVIDIA Quantum switches (QM8700/QM9700), SHARP VL dedicated
NCCL config	Default	NCCL_ALGO=TREE or set via SHARP plugin

Common InfiniBand Issues & Fixes

Real failure scenarios with symptoms, diagnostic commands, and resolution steps.

Troubleshooting Workflow — Always Follow This Order

1. Run ibstat on all local HCA ports — confirm Active/LinkUp at correct speed. 2. Run ibdiagnet fabric-wide — look for error counters and topology mismatches. 3. Check opensm.log for rerouting events, SM state changes, or "port error" entries. 4. Use perfquery on suspect switch ports — rising SymbolErrors point to physical layer. 5. Use ibping to confirm end-to-end reachability before running NCCL.

Physical Layer

Link Running at Wrong Speed (e.g., HDR instead of NDR)

Symptoms

NCCL bandwidth lower than expected
iblinkinfo shows 2× instead of 4× width
One rail has half the throughput of others

Diagnosis & Fix

Run iblinkinfo to list all link speeds
Check cable: mixed NDR/HDR cables may negotiate to lower speed
Inspect SFP/QSFP transceiver — wrong cable type
Replace cable; verify with ibstat Rate field = 400

SM Configuration

Rogue Subnet Manager Causing Fabric Instability

Symptoms

Fabric reroutes randomly, disrupting NCCL jobs
opensm.log shows "SM state change" frequently
ibdiagnet reports two or more active SMs

Diagnosis & Fix

Run ibdiagnet — look for "Multiple SM" warning
Run smpquery sminfo 0 to find SM GUID
Identify rogue host via GUID, disable opensm there
Set sm_priority consistently across all legitimate SMs

Error Counters

Rising SymbolErrors — Physical Link Degradation

Symptoms

perfquery shows non-zero SymbolErrorCounter
Retransmissions increase — ibdiagnet flags port
Intermittent NCCL hangs or reduced bandwidth

Diagnosis & Fix

Run perfquery <lid> <port> and reset counters to track rate
Identify the switch port and physical cable segment
Reseat or replace the copper/fiber cable
If errors persist after cable swap, replace HCA or switch port module

Routing / QoS

NCCL Hangs During AllReduce — Credit Starvation

Symptoms

NCCL timeout after long idle period mid-training
All GPUs stuck waiting on same barrier
Switch counters show XmtWait rising on specific ports

Diagnosis & Fix

Check SL-to-VL mapping — management traffic sharing VL0 with data causes starvation
Enable QoS in opensm.conf: qos TRUE
Map SM traffic (SL1) to VL15 (management VL, always has credits)
Enable IBCC Enhanced Congestion Control on switches

Partition / Access

Hosts Cannot Communicate — P_Key Mismatch

Symptoms

ibping fails between two hosts despite Active links
NCCL reports "connection refused" or "transport error"
No physical errors — links are up and at correct speed

Diagnosis & Fix

Run smpquery pkeys <lid> <port> on both hosts
Verify both ports share a common P_Key (must match, including membership bit)
Check opensm partition configuration: /etc/opensm/partitions.conf
Add both hosts to the same partition; trigger SM sweep to reprogram P_Keys

ibdiagnet — Key Output Fields

ibdiagnet --pc -r

# Topology summary
Discovered 48 nodes, 16 switches, 512 ports
Subnet Manager: MASTER on 0x00000000000b0001 (LID 1)

# Error counter alerts (non-zero = investigate)
-E- PortXmitDiscards: Switch 0x..., Port 12: 145
-E- SymbolError: Switch 0x..., Port 7: 8923

# Link speed summary
Links at expected speed (NDR 400Gb/s): 480/512
Links at degraded speed (HDR 200Gb/s):  32/512  ← investigate

# Partition check
Partition 0x7fff (default): 512 ports member

Port Error Counter Reference

Counter	Indicates	Action
SymbolErrorCounter	Physical layer signal errors — bad cable, connector, or transceiver	Replace cable; check SFP/QSFP module
PortRcvErrors	Packets received with errors (bad VCRC/ICRC)	Physical layer fault; check same cable as SymbolErrors
PortXmitDiscards	Packets dropped at egress — buffer overflow / congestion	Enable congestion control; check for routing imbalance
LocalLinkIntegrityErrors	Link has recovered from errors but degraded — near failure	Proactive cable replacement before full failure
ExcessiveBufferOverrunErrors	Repeated buffer overruns — severe congestion or misconfigured VLs	Fix SL-to-VL mapping; enable QoS separation
VL15Dropped	Management packets (SM traffic on VL15) were dropped	Critical — SM cannot manage fabric; clear congestion on management path

Practice Quiz

10 NCP-AIN style questions on IB configuration, optimization, and troubleshooting.

IB Problem Advisor

Describe your InfiniBand issue — get targeted diagnostic and configuration guidance.

What InfiniBand problem are you facing?

Memory Hooks

Tap each card to flip and reveal the definition. Tap again to flip back.

Subnet Manager Role

5 phases it runs at startup

Discover → Assign LIDs → Compute routes → Program LFTs → ActivateWithout an active SM, no host can communicate. Only one master SM per fabric; all others are standby.

Routing

ftree vs Adaptive Routing

Where each is configured

ftree: opensm.conf (static, computed at init)Adaptive Routing: switch MLNX-OS (dynamic, per-flow real-time)Both should be enabled together in AI cluster fabrics for best load distribution.

Congestion

FECN → BECN → Rate Limit

The three IB congestion steps

Forward ECN (switch marks packet) → Backward ECN (receiver notifies sender) → Sender reduces injection rateIB never drops packets for congestion — credit flow control prevents it at the link layer.

VL15

VL15 — The Sacred Lane

Why it must never be congested

VL15 carries SM management packets (SMPs)If VL15 is congested and VL15Dropped > 0, the SM loses visibility into the fabric. Routing tables cannot be updated. A persistent VL15Dropped counter is a critical emergency.

SHARP

SHARP AllReduce

What it does + what it requires

Aggregates AllReduce data inside switches — not on hostsRequires NVIDIA Quantum switches + dedicated VL for SHARP traffic. Provides up to 50% AllReduce latency reduction in large GPU clusters.

Diagnostics

ibdiagnet vs ibstat

Scope difference

ibdiagnet: fabric-wide scan (topology, all port errors, cabling)ibstat: local HCA port state (LID, speed, physical state)Start with ibdiagnet for fabric health; use ibstat for "is my local NIC up?" checks.

Error

SymbolErrorCounter

What it means + fix

Physical layer signal degradation — bad cable, SFP, or connectorAny non-zero and rising value means replace the cable first. If errors persist after cable swap, suspect the HCA or switch port module.

P_Key

Partition Key

What causes P_Key failure

IB partitioning — hosts must share a P_Key to communicateibping fails with no physical errors? → P_Key mismatch. Check with smpquery pkeys. Fix by adding both hosts to the same partition in opensm's partitions.conf.

InfiniBandConfiguration, Optimization& Troubleshooting

IB Fabric Bring-Up Sequence

What happens when OpenSM starts

OpenSM Key Configuration Parameters

Routing Algorithms — Choose the Right One

InfiniBand Congestion Control — IBCC Flow

How Enhanced Congestion Notification Works

Service Levels (SL) and Virtual Lanes (VL)

RDMA Performance Tuning Parameters

Key Commands — Configuration & Tuning

Routing Algorithms Compared

SM Master vs Standby vs Rogue SM

Diagnostic Commands Compared

SHARP vs Standard AllReduce

ibdiagnet — Key Output Fields

Port Error Counter Reference

Ready to Master IB for the NCP-AIN Exam?

InfiniBand
Configuration, Optimization
& Troubleshooting