NCP-AIN · Topic 2 · InfiniBand Operations

InfiniBand
Configuration, Optimization
& Troubleshooting

Subnet managers, adaptive routing, congestion control, RDMA tuning, and ibdiagnet diagnostics — every operational layer of the IB fabric.

4
Routing Engines
15+
Config Parameters
10
Practice Questions
8
Memory Cards
Start Free Practice →
Four Operational Pillars of InfiniBand
From fabric bring-up through performance tuning and fault isolation — the operational skills NCP-AIN tests.
🧠
Subnet Manager
Fabric discovery, LID assignment, routing table computation (FTLV), OpenSM configuration, master vs standby SM
Key daemonopensm
Config file/etc/opensm/opensm.conf
🔀
Routing & QoS
Fat-tree, adaptive routing, SL-to-VL mapping, Virtual Lanes, partition keys (P_Key), and QoS policies
Routing engines4+
Max VLsVL0–VL15 (16 total)
RDMA Tuning
MTU, queue depth, CQ moderation, GID index, SR-IOV, SHARP for in-network AllReduce, C-state disabling
MTU target4096 B
SHARP gainUp to 50% AllReduce speedup
🔬
Diagnostics & Troubleshooting
ibdiagnet, ibstat, ibping, perfquery, smpquery — counter-based fault isolation across link, switch, and fabric layers
Primary toolibdiagnet
Port countersSymbolErrors, RcvErrors…
Why InfiniBand Operations Matter for AI Clusters

A misconfigured SM routing algorithm can reduce AllReduce bandwidth by 30–40%. A single bad cable introducing symbol errors causes retransmissions that spread across the entire NCCL ring. The NCP-AIN exam tests your ability to configure, tune, and diagnose the IB fabric — not just describe its architecture.

IB Fabric Bring-Up Sequence

What happens when OpenSM starts

The SM must complete all five phases before any host-to-host traffic can flow.
🔎
Phase 1 — Discovery (Subnet Sweep)
SM sends SMPs (Subnet Management Packets) to enumerate all nodes, switches, and links. Builds a complete topology graph.
SMP-based
🏷️
Phase 2 — LID Assignment
SM assigns a unique Local Identifier (LID, 1–49151) to every HCA port and switch port. LIDs are the IB routing addresses.
16-bit address
🗺️
Phase 3 — Route Computation
SM runs the selected routing algorithm (fat-tree, MINHOP, DFSSSP, etc.) to build forwarding tables (Linear Forwarding Tables, LFT) for every switch.
LFT per switch
📥
Phase 4 — Table Programming
SM writes all LFTs to every switch using directed-route SMPs. Each switch now knows which output port to use for any destination LID.
Programmed via SMP
Phase 5 — Activate (SM Active)
SM sets port states to ACTIVE. Hosts can now resolve GIDs to LIDs via SA (Subnet Administrator) queries and begin RDMA operations.
Traffic flows
Configuration & Optimization Deep Dive
The exact parameters and mechanisms the exam tests — with recommended values for AI cluster fabrics.

OpenSM Key Configuration Parameters

ParameterDefault / RecommendedWhat It ControlsAI Cluster Guidance
routing_engineminhop → ftreeRouting algorithm used to compute LFTsUse ftree for fat-tree topologies; dfsssp for irregular fabrics
sm_priority0 (range 0–15)Master SM election — higher priority winsSet primary SM to 14, standby to 13. Never leave two SMs at same priority.
sweep_interval10 secondsHow often SM re-scans fabric for changesKeep at 10s; reduce to 5s for unstable fabrics during bring-up only
lmc0 (LID Mask Control)Number of LIDs assigned per port (2^lmc)Set lmc=2 (4 LIDs per port) for multipath routing with MINHOP
qosdisabledEnables SL-to-VL QoS mappingEnable for fabrics mixing RDMA and management traffic on same links
log_file/var/log/opensm.logSM diagnostic log locationMonitor for "SM state change", "port error", and "rerouting" events
reassign_lids0Force LID reassignment on SM restartLeave 0 in production — LID changes trigger fabric-wide rerouting

Routing Algorithms — Choose the Right One

Fat-Tree (ftree)
routing_engine ftree
Computes routes that maximize bisection bandwidth in a balanced fat-tree topology. Distributes flows across all uplinks equally — prevents hot spots on up-links.
✅ Use for:Standard two-tier or three-tier fat-tree AI cluster fabrics. The default choice for most NVIDIA DGX SuperPOD deployments.
MINHOP
routing_engine minhop
Computes minimum-hop paths. Simple and fast to compute. Uses LMC to provide multiple equal-cost paths when lmc > 0.
✅ Use for:Simple test fabrics or small clusters. Not recommended for large AI training fabrics — does not balance load across links as well as ftree.
DFSSSP
routing_engine dfsssp
Deadline-First Shortest-path Subnet Partitioning. Handles irregular and asymmetric topologies. Supports enhanced QoS and is more flexible than ftree.
✅ Use for:Irregular fabrics, mixed switch generations, or fabrics where some switches have failed and topology is no longer symmetric.
Adaptive Routing — SM-Independent Dynamic Load Balancing

Unlike SM-computed static routing, Adaptive Routing (AR) is enabled per-switch and reroutes individual flows in real time to avoid congested paths. In NVIDIA Quantum switches (QM9700/QM8700) AR is configured via ib ar enable in MLNX-OS. AR requires all switches in the fabric to be AR-capable and is essential for rail-optimized fabrics where bursty AllReduce traffic creates transient hotspots.

InfiniBand Congestion Control — IBCC Flow

How Enhanced Congestion Notification Works

IB congestion control prevents credit starvation without packet drops — unlike TCP which relies on loss for backpressure.
🌊
Congestion Point
Switch buffer exceeds threshold on an output port
📡
FECN Mark
Switch marks Forward ECN bit in packet header
📨
BECN Return
Receiver sends Backward ECN notice back to sender
🐢
Sender Rate Limit
Sender throttles injection rate; recovers gradually

Key difference from RoCE: IB uses credit-based flow control at the link layer — switches track buffer availability per VL. FECN/BECN augments this for end-to-end congestion. Packet loss from congestion should be essentially zero in a well-configured IB fabric.

Service Levels (SL) and Virtual Lanes (VL)

SLMaps to VLTraffic TypeTypical Use in AI Clusters
SL0VL0DataNCCL AllReduce, RDMA Write — bulk GPU-to-GPU collective traffic
SL1VL1ManagementSM packets (SMPs), SA queries — must never be blocked by data traffic
SL2VL2RoCE v2Separate VL for storage traffic (GPFS, Lustre) to isolate from training traffic
SL5VL5SHARPSHARP AllReduce aggregation traffic — requires dedicated VL for correct operation

RDMA Performance Tuning Parameters

MTU
Maximum Transfer Unit
Larger MTU reduces per-message header overhead for bulk AI traffic. IB supports 256 / 512 / 1024 / 2048 / 4096 bytes.
✅ Set MTU=4096 for AI training
tx_depth / rx_depth
Send/Receive Queue Depth
Deeper queues allow more in-flight RDMA operations, hiding latency on long-RTT paths across large fabrics.
✅ Set to 1024 for large clusters
CQ moderation
Completion Queue Interrupt Batching
Batch multiple completions into one interrupt event. Reduces CPU overhead from interrupt storms at high message rates.
✅ Enable; batch 32–64 events
C-states (idle CPU)
CPU Power Management
Deep C-states (C6, C7) add 10–100µs wake-up latency. For RDMA polling loops this causes jitter that disrupts tight AllReduce synchronization.
✅ Disable: cpupower idle-set -D 0
NVIDIA SHARP
In-Network AllReduce
Offloads AllReduce aggregation into IB switches, eliminating host-to-host data round trips. Requires SHARP-capable Quantum switches.
✅ Up to 50% AllReduce speedup
GID index
RoCE / IB GID Selection
Each HCA port has multiple Global Identifiers. GID index 0 is the link-local address; higher indices are RoCE v2 IP-based GIDs. NCCL must be configured with the correct GID index.
✅ NCCL_IB_GID_INDEX=3 (RoCE v2)

Key Commands — Configuration & Tuning

# Check HCA port state and speed
ibstat mlx5_0
  State: Active   Physical state: LinkUp   Rate: 400

# Set MTU to 4096 (IB datagram) on all ports
opensm.conf:  subnet_timeout  18

# Disable CPU C-states to reduce latency jitter
cpupower idle-set -D 0

# Check active SM and its GUID
smpquery nodeinfo 0    # query SM via lid 0

# Verify partition keys on a port
smpquery pkeys <lid> <port>

# Set NCCL to use IB with RoCE v2 GID index 3
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
Key Concepts Compared
Side-by-side breakdowns to sharpen exam accuracy on easily confused concepts.

Routing Algorithms Compared

PropertyFat-Tree (ftree)MINHOPDFSSSPAdaptive Routing
Configured inopensm.confopensm.confopensm.confSwitch (MLNX-OS)
Route typeStatic (at init)Static (at init)Static (at init)Dynamic (per-flow)
Best topologyBalanced fat-treeAny simple fabricIrregular/asymmetricAny (complements static)
Load balancing✅ Excellent (ECMP)⚠️ Moderate✅ Good✅ Best (real-time)
Requires SHARP?NoNoNoNo (independent)
AI cluster default✅ Yes❌ NoFallback✅ Strongly recommended

SM Master vs Standby vs Rogue SM

StateMaster SMStandby SMRogue SM
Count per fabricExactly 10 or more0 (unwanted)
Programs LFTs?✅ Yes❌ No (monitors only)⚠️ Yes — causes chaos
Elected byHighest sm_priority winsLower priority defersUnconfigured SM on a host
FailoverPromotes to master if master failsMay become master unexpectedly
Detectionopensm.log "Master"opensm.log "Standby"ibdiagnet reports multiple SMs
Fix if rogueDisable opensm on rogue host immediately

Diagnostic Commands Compared

CommandScopeWhat It ShowsWhen to Use
ibdiagnetFabric-wideTopology, errors on all ports, cabling, partition issuesFirst step — comprehensive fabric health scan
ibstatLocal HCA portPort state, LID, SM LID, link speed, physical stateVerify HCA is up and at correct speed before NCCL run
perfquerySingle portError counters: SymbolErrors, RcvErrors, XmtDiscardsDeep-dive on a specific suspect port/switch
ibpingEnd-to-endRound-trip latency to a destination LID or GIDTest connectivity and measure baseline latency
smpqueryFabric elementRaw SMP attributes: LFT entries, port info, P_Keys, SM infoVerify SM configuration was applied correctly
iblinkinfoFabric-wideAll link speeds and widths across the fabricFind links running at wrong speed (e.g., HDR instead of NDR)

SHARP vs Standard AllReduce

PropertyStandard IB AllReduceSHARP In-Network AllReduce
Where aggregation runsOn host CPUs (via NCCL)Inside IB switches (dedicated tree)
Data traversalAll data goes host → switch → host multiple timesData aggregated at each switch level, only result returns to hosts
Bandwidth requirementO(N) — grows with node countO(1) — switch-local aggregation
Latency reductionBaselineUp to 50% for large AllReduce
GPU utilization impactGPUs idle during AllReduceGPUs partially freed during in-switch aggregation
RequirementsAny IB switchNVIDIA Quantum switches (QM8700/QM9700), SHARP VL dedicated
NCCL configDefaultNCCL_ALGO=TREE or set via SHARP plugin
Common InfiniBand Issues & Fixes
Real failure scenarios with symptoms, diagnostic commands, and resolution steps.
Troubleshooting Workflow — Always Follow This Order

1. Run ibstat on all local HCA ports — confirm Active/LinkUp at correct speed. 2. Run ibdiagnet fabric-wide — look for error counters and topology mismatches. 3. Check opensm.log for rerouting events, SM state changes, or "port error" entries. 4. Use perfquery on suspect switch ports — rising SymbolErrors point to physical layer. 5. Use ibping to confirm end-to-end reachability before running NCCL.

Physical Layer
Link Running at Wrong Speed (e.g., HDR instead of NDR)
Symptoms
  • NCCL bandwidth lower than expected
  • iblinkinfo shows 2× instead of 4× width
  • One rail has half the throughput of others
Diagnosis & Fix
  • Run iblinkinfo to list all link speeds
  • Check cable: mixed NDR/HDR cables may negotiate to lower speed
  • Inspect SFP/QSFP transceiver — wrong cable type
  • Replace cable; verify with ibstat Rate field = 400
SM Configuration
Rogue Subnet Manager Causing Fabric Instability
Symptoms
  • Fabric reroutes randomly, disrupting NCCL jobs
  • opensm.log shows "SM state change" frequently
  • ibdiagnet reports two or more active SMs
Diagnosis & Fix
  • Run ibdiagnet — look for "Multiple SM" warning
  • Run smpquery sminfo 0 to find SM GUID
  • Identify rogue host via GUID, disable opensm there
  • Set sm_priority consistently across all legitimate SMs
Error Counters
Rising SymbolErrors — Physical Link Degradation
Symptoms
  • perfquery shows non-zero SymbolErrorCounter
  • Retransmissions increase — ibdiagnet flags port
  • Intermittent NCCL hangs or reduced bandwidth
Diagnosis & Fix
  • Run perfquery <lid> <port> and reset counters to track rate
  • Identify the switch port and physical cable segment
  • Reseat or replace the copper/fiber cable
  • If errors persist after cable swap, replace HCA or switch port module
Routing / QoS
NCCL Hangs During AllReduce — Credit Starvation
Symptoms
  • NCCL timeout after long idle period mid-training
  • All GPUs stuck waiting on same barrier
  • Switch counters show XmtWait rising on specific ports
Diagnosis & Fix
  • Check SL-to-VL mapping — management traffic sharing VL0 with data causes starvation
  • Enable QoS in opensm.conf: qos TRUE
  • Map SM traffic (SL1) to VL15 (management VL, always has credits)
  • Enable IBCC Enhanced Congestion Control on switches
Partition / Access
Hosts Cannot Communicate — P_Key Mismatch
Symptoms
  • ibping fails between two hosts despite Active links
  • NCCL reports "connection refused" or "transport error"
  • No physical errors — links are up and at correct speed
Diagnosis & Fix
  • Run smpquery pkeys <lid> <port> on both hosts
  • Verify both ports share a common P_Key (must match, including membership bit)
  • Check opensm partition configuration: /etc/opensm/partitions.conf
  • Add both hosts to the same partition; trigger SM sweep to reprogram P_Keys

ibdiagnet — Key Output Fields

ibdiagnet --pc -r

# Topology summary
Discovered 48 nodes, 16 switches, 512 ports
Subnet Manager: MASTER on 0x00000000000b0001 (LID 1)

# Error counter alerts (non-zero = investigate)
-E- PortXmitDiscards: Switch 0x..., Port 12: 145
-E- SymbolError: Switch 0x..., Port 7: 8923

# Link speed summary
Links at expected speed (NDR 400Gb/s): 480/512
Links at degraded speed (HDR 200Gb/s):  32/512  ← investigate

# Partition check
Partition 0x7fff (default): 512 ports member

Port Error Counter Reference

CounterIndicatesAction
SymbolErrorCounterPhysical layer signal errors — bad cable, connector, or transceiverReplace cable; check SFP/QSFP module
PortRcvErrorsPackets received with errors (bad VCRC/ICRC)Physical layer fault; check same cable as SymbolErrors
PortXmitDiscardsPackets dropped at egress — buffer overflow / congestionEnable congestion control; check for routing imbalance
LocalLinkIntegrityErrorsLink has recovered from errors but degraded — near failureProactive cable replacement before full failure
ExcessiveBufferOverrunErrorsRepeated buffer overruns — severe congestion or misconfigured VLsFix SL-to-VL mapping; enable QoS separation
VL15DroppedManagement packets (SM traffic on VL15) were droppedCritical — SM cannot manage fabric; clear congestion on management path
Practice Quiz
10 NCP-AIN style questions on IB configuration, optimization, and troubleshooting.
IB Problem Advisor
Describe your InfiniBand issue — get targeted diagnostic and configuration guidance.

    Memory Hooks
    Tap each card to flip and reveal the definition. Tap again to flip back.
    SM
    Subnet Manager Role
    5 phases it runs at startup
    Discover → Assign LIDs → Compute routes → Program LFTs → ActivateWithout an active SM, no host can communicate. Only one master SM per fabric; all others are standby.
    Routing
    ftree vs Adaptive Routing
    Where each is configured
    ftree: opensm.conf (static, computed at init)Adaptive Routing: switch MLNX-OS (dynamic, per-flow real-time)Both should be enabled together in AI cluster fabrics for best load distribution.
    Congestion
    FECN → BECN → Rate Limit
    The three IB congestion steps
    Forward ECN (switch marks packet) → Backward ECN (receiver notifies sender) → Sender reduces injection rateIB never drops packets for congestion — credit flow control prevents it at the link layer.
    VL15
    VL15 — The Sacred Lane
    Why it must never be congested
    VL15 carries SM management packets (SMPs)If VL15 is congested and VL15Dropped > 0, the SM loses visibility into the fabric. Routing tables cannot be updated. A persistent VL15Dropped counter is a critical emergency.
    SHARP
    SHARP AllReduce
    What it does + what it requires
    Aggregates AllReduce data inside switches — not on hostsRequires NVIDIA Quantum switches + dedicated VL for SHARP traffic. Provides up to 50% AllReduce latency reduction in large GPU clusters.
    Diagnostics
    ibdiagnet vs ibstat
    Scope difference
    ibdiagnet: fabric-wide scan (topology, all port errors, cabling)ibstat: local HCA port state (LID, speed, physical state)Start with ibdiagnet for fabric health; use ibstat for "is my local NIC up?" checks.
    Error
    SymbolErrorCounter
    What it means + fix
    Physical layer signal degradation — bad cable, SFP, or connectorAny non-zero and rising value means replace the cable first. If errors persist after cable swap, suspect the HCA or switch port module.
    P_Key
    Partition Key
    What causes P_Key failure
    IB partitioning — hosts must share a P_Key to communicateibping fails with no physical errors? → P_Key mismatch. Check with smpquery pkeys. Fix by adding both hosts to the same partition in opensm's partitions.conf.
    NCP-AIN · Free Practice

    Ready to Master IB for the NCP-AIN Exam?

    Practice InfiniBand, RoCE, fabric design, DPU architecture, and all NCP-AIN topics with adaptive quizzes and full exam simulations.

    Start Free Practice →