Subnet managers, adaptive routing, congestion control, RDMA tuning, and ibdiagnet diagnostics — every operational layer of the IB fabric.
A misconfigured SM routing algorithm can reduce AllReduce bandwidth by 30–40%. A single bad cable introducing symbol errors causes retransmissions that spread across the entire NCCL ring. The NCP-AIN exam tests your ability to configure, tune, and diagnose the IB fabric — not just describe its architecture.
| Parameter | Default / Recommended | What It Controls | AI Cluster Guidance |
|---|---|---|---|
| routing_engine | minhop → ftree | Routing algorithm used to compute LFTs | Use ftree for fat-tree topologies; dfsssp for irregular fabrics |
| sm_priority | 0 (range 0–15) | Master SM election — higher priority wins | Set primary SM to 14, standby to 13. Never leave two SMs at same priority. |
| sweep_interval | 10 seconds | How often SM re-scans fabric for changes | Keep at 10s; reduce to 5s for unstable fabrics during bring-up only |
| lmc | 0 (LID Mask Control) | Number of LIDs assigned per port (2^lmc) | Set lmc=2 (4 LIDs per port) for multipath routing with MINHOP |
| qos | disabled | Enables SL-to-VL QoS mapping | Enable for fabrics mixing RDMA and management traffic on same links |
| log_file | /var/log/opensm.log | SM diagnostic log location | Monitor for "SM state change", "port error", and "rerouting" events |
| reassign_lids | 0 | Force LID reassignment on SM restart | Leave 0 in production — LID changes trigger fabric-wide rerouting |
Unlike SM-computed static routing, Adaptive Routing (AR) is enabled per-switch and reroutes individual flows in real time to avoid congested paths. In NVIDIA Quantum switches (QM9700/QM8700) AR is configured via ib ar enable in MLNX-OS. AR requires all switches in the fabric to be AR-capable and is essential for rail-optimized fabrics where bursty AllReduce traffic creates transient hotspots.
Key difference from RoCE: IB uses credit-based flow control at the link layer — switches track buffer availability per VL. FECN/BECN augments this for end-to-end congestion. Packet loss from congestion should be essentially zero in a well-configured IB fabric.
| SL | Maps to VL | Traffic Type | Typical Use in AI Clusters |
|---|---|---|---|
| SL0 | VL0 | Data | NCCL AllReduce, RDMA Write — bulk GPU-to-GPU collective traffic |
| SL1 | VL1 | Management | SM packets (SMPs), SA queries — must never be blocked by data traffic |
| SL2 | VL2 | RoCE v2 | Separate VL for storage traffic (GPFS, Lustre) to isolate from training traffic |
| SL5 | VL5 | SHARP | SHARP AllReduce aggregation traffic — requires dedicated VL for correct operation |
# Check HCA port state and speed ibstat mlx5_0 State: Active Physical state: LinkUp Rate: 400 # Set MTU to 4096 (IB datagram) on all ports opensm.conf: subnet_timeout 18 # Disable CPU C-states to reduce latency jitter cpupower idle-set -D 0 # Check active SM and its GUID smpquery nodeinfo 0 # query SM via lid 0 # Verify partition keys on a port smpquery pkeys <lid> <port> # Set NCCL to use IB with RoCE v2 GID index 3 export NCCL_IB_GID_INDEX=3 export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3
| Property | Fat-Tree (ftree) | MINHOP | DFSSSP | Adaptive Routing |
|---|---|---|---|---|
| Configured in | opensm.conf | opensm.conf | opensm.conf | Switch (MLNX-OS) |
| Route type | Static (at init) | Static (at init) | Static (at init) | Dynamic (per-flow) |
| Best topology | Balanced fat-tree | Any simple fabric | Irregular/asymmetric | Any (complements static) |
| Load balancing | ✅ Excellent (ECMP) | ⚠️ Moderate | ✅ Good | ✅ Best (real-time) |
| Requires SHARP? | No | No | No | No (independent) |
| AI cluster default | ✅ Yes | ❌ No | Fallback | ✅ Strongly recommended |
| State | Master SM | Standby SM | Rogue SM |
|---|---|---|---|
| Count per fabric | Exactly 1 | 0 or more | 0 (unwanted) |
| Programs LFTs? | ✅ Yes | ❌ No (monitors only) | ⚠️ Yes — causes chaos |
| Elected by | Highest sm_priority wins | Lower priority defers | Unconfigured SM on a host |
| Failover | — | Promotes to master if master fails | May become master unexpectedly |
| Detection | opensm.log "Master" | opensm.log "Standby" | ibdiagnet reports multiple SMs |
| Fix if rogue | — | — | Disable opensm on rogue host immediately |
| Command | Scope | What It Shows | When to Use |
|---|---|---|---|
| ibdiagnet | Fabric-wide | Topology, errors on all ports, cabling, partition issues | First step — comprehensive fabric health scan |
| ibstat | Local HCA port | Port state, LID, SM LID, link speed, physical state | Verify HCA is up and at correct speed before NCCL run |
| perfquery | Single port | Error counters: SymbolErrors, RcvErrors, XmtDiscards | Deep-dive on a specific suspect port/switch |
| ibping | End-to-end | Round-trip latency to a destination LID or GID | Test connectivity and measure baseline latency |
| smpquery | Fabric element | Raw SMP attributes: LFT entries, port info, P_Keys, SM info | Verify SM configuration was applied correctly |
| iblinkinfo | Fabric-wide | All link speeds and widths across the fabric | Find links running at wrong speed (e.g., HDR instead of NDR) |
| Property | Standard IB AllReduce | SHARP In-Network AllReduce |
|---|---|---|
| Where aggregation runs | On host CPUs (via NCCL) | Inside IB switches (dedicated tree) |
| Data traversal | All data goes host → switch → host multiple times | Data aggregated at each switch level, only result returns to hosts |
| Bandwidth requirement | O(N) — grows with node count | O(1) — switch-local aggregation |
| Latency reduction | Baseline | Up to 50% for large AllReduce |
| GPU utilization impact | GPUs idle during AllReduce | GPUs partially freed during in-switch aggregation |
| Requirements | Any IB switch | NVIDIA Quantum switches (QM8700/QM9700), SHARP VL dedicated |
| NCCL config | Default | NCCL_ALGO=TREE or set via SHARP plugin |
1. Run ibstat on all local HCA ports — confirm Active/LinkUp at correct speed. 2. Run ibdiagnet fabric-wide — look for error counters and topology mismatches. 3. Check opensm.log for rerouting events, SM state changes, or "port error" entries. 4. Use perfquery on suspect switch ports — rising SymbolErrors point to physical layer. 5. Use ibping to confirm end-to-end reachability before running NCCL.
iblinkinfo shows 2× instead of 4× widthiblinkinfo to list all link speedsibstat Rate field = 400ibdiagnet — look for "Multiple SM" warningsmpquery sminfo 0 to find SM GUIDsm_priority consistently across all legitimate SMsperfquery shows non-zero SymbolErrorCounterperfquery <lid> <port> and reset counters to track rateqos TRUEibping fails between two hosts despite Active linkssmpquery pkeys <lid> <port> on both hosts/etc/opensm/partitions.confibdiagnet --pc -r # Topology summary Discovered 48 nodes, 16 switches, 512 ports Subnet Manager: MASTER on 0x00000000000b0001 (LID 1) # Error counter alerts (non-zero = investigate) -E- PortXmitDiscards: Switch 0x..., Port 12: 145 -E- SymbolError: Switch 0x..., Port 7: 8923 # Link speed summary Links at expected speed (NDR 400Gb/s): 480/512 Links at degraded speed (HDR 200Gb/s): 32/512 ← investigate # Partition check Partition 0x7fff (default): 512 ports member
| Counter | Indicates | Action |
|---|---|---|
| SymbolErrorCounter | Physical layer signal errors — bad cable, connector, or transceiver | Replace cable; check SFP/QSFP module |
| PortRcvErrors | Packets received with errors (bad VCRC/ICRC) | Physical layer fault; check same cable as SymbolErrors |
| PortXmitDiscards | Packets dropped at egress — buffer overflow / congestion | Enable congestion control; check for routing imbalance |
| LocalLinkIntegrityErrors | Link has recovered from errors but degraded — near failure | Proactive cable replacement before full failure |
| ExcessiveBufferOverrunErrors | Repeated buffer overruns — severe congestion or misconfigured VLs | Fix SL-to-VL mapping; enable QoS separation |
| VL15Dropped | Management packets (SM traffic on VL15) were dropped | Critical — SM cannot manage fabric; clear congestion on management path |
smpquery pkeys. Fix by adding both hosts to the same partition in opensm's partitions.conf.