SmartNIC architecture, operating modes, DOCA programming model, ASAP² OVS offload, inline IPsec, GPUNetIO, and Morpheus AI security — the full DPU stack.
A Data Processing Unit combines a high-performance network interface (ConnectX-7 at 400GbE) with general-purpose Arm compute cores and dedicated hardware accelerators for crypto, compression, and regex — all on one PCIe card. The key insight: run the infrastructure software (firewalls, OVS, storage, encryption) on the DPU's Arm cores, freeing the host CPU exclusively for application workloads.
Replace the host CPU's software networking stack with hardware acceleration on the DPU.
Run security infrastructure isolated from tenant workloads — tenants cannot tamper with it.
NVMe-over-Fabrics target/initiator and data services accelerated in DPU hardware.
Application logic — web servers, databases, AI inference, business logic. Should spend zero cycles on infrastructure plumbing.
Massively parallel compute — AI training, inference, HPC simulations, rendering. Needs maximum memory bandwidth for tensor ops.
Infrastructure services — networking, security, storage, telemetry. Isolated from tenants. Programmable via DOCA SDK.
| BlueField-3 Specs | |
|---|---|
| Arm cores | 16× Arm A78 (Armv8.2+) |
| Network speed | 400GbE (2 × 200GbE) |
| NIC controller | ConnectX-7 |
| PCIe generation | Gen5 ×16 |
| Local DRAM | Up to 32 GB DDR5 |
| Crypto throughput | 400 Gbps inline IPsec |
| Compression | Hardware LZ4/Deflate |
| Management port | Out-of-band 1GbE |
| OS | BF-OS (Yocto Linux on Arm) |
| SDK | DOCA 2.x |
Scalable Functions (SF) are a lightweight alternative to VFs — they don't require PCIe function enumeration, making them faster to create/destroy for containerized workloads (Kubernetes pods).
BlueField acts as a standard high-performance NIC. Arm cores are idle or running minimal firmware. Host sees a ConnectX-7 device — full 400GbE bandwidth available. DOCA not active.
Arm cores run DOCA applications that share the NIC with the host. Host and Arm both see the network interface. Good for in-line telemetry and monitoring without full isolation.
The primary DPU deployment mode. Host OS and DPU Arm OS are fully isolated — they cannot access each other's memory or management plane. The Arm runs a complete Yocto Linux with DOCA services. Host sees only the VFs/SFs the DPU grants it.
Host is treated as an untrusted tenant. DPU Arm controls all network policy. Even if the host OS is compromised, the attacker cannot modify network policy or access other tenants' traffic. Strongest security isolation.
How many Arm cores in BlueField-3 vs BlueField-2?
BF-3: 16× Arm A78
BF-2: 8× Arm A72
BF-3 doubles the cores, upgrades the microarch (A78 vs A72), adds PCIe Gen5, and ConnectX-7 vs ConnectX-6 Dx.
What PCIe generation does BlueField-3 use?
PCIe Gen5 ×16
BlueField-2 used PCIe Gen4. Gen5 doubles bandwidth to ~64 GB/s — critical for GPUDirect RDMA and DOCA GPUNetIO workloads where DMA bandwidth is the bottleneck.
In separated (DPU) mode, what security property holds?
Host OS ↔ DPU Arm OS: fully isolated
A compromised host cannot reach DPU management or other tenants' traffic. The DPU is the trusted enforcement point — even if the host is hostile.
What is tmfifo on BlueField?
tmfifo = Terminal / Management FIFO
A virtual serial console channel between the host CPU and the DPU Arm OS over PCIe. Used for initial provisioning, console access, and rshim communication when no OOB network is available.
DOCA (Data Center Infrastructure on a Chip Architecture) is NVIDIA's SDK for programming BlueField DPUs. It provides C APIs that abstract hardware accelerators — crypto, compress, regex, DMA, RDMA, packet processing — so developers don't need to write low-level register code. DOCA applications run on the DPU's Arm cores (or from the host for some operations).
Pipe-based match-action packet processing pipeline. The core of OVS offload, firewall, and load balancer implementations.
NetworkDeep Packet Inspection — L7 application identification and pattern matching using hardware Regex engine.
SecurityInline IPsec encryption/decryption. AES-GCM at up to 400 Gbps — full line rate on BF-3 with zero host CPU.
SecurityStateful connection tracking + ACL enforcement. Offloads iptables/nftables to DPU hardware match-action tables.
SecurityLZ4 and Deflate lossless compression/decompression in hardware. Used for storage data services and network payload reduction.
StorageReceive network packets directly into GPU memory via DMA — bypassing host CPU and enabling GPU-native packet processing.
AI / GPURDMA operations (read, write, send/recv) initiated from DPU Arm cores. Used for distributed storage and in-network compute.
NetworkHardware-accelerated regular expression matching. Powers DPI pattern libraries and IDS/IPS signature matching at line rate.
SecurityCollect and stream per-flow, per-port, and system-level metrics from DPU to monitoring systems (Prometheus, Grafana, gRPC).
ObservabilityRepresents a hardware engine instance (e.g., one Compress engine, one IPsec SA). Lifecycle: create → connect to progress engine → start → use → stop → destroy.
Drives asynchronous task completion. Poll-based (like DPDK) — call doca_pe_progress() in your event loop to harvest completed tasks without blocking.
Registers a memory region with the DPU for DMA access. Must register any host or GPU buffer before the DPU DMA engine can read/write it.
Represents a slice of a registered memory region. Allocated from a doca_buf_inventory. Passed to tasks as source/destination for DMA and crypto operations.
A unit of work submitted to a ctx — e.g., encrypt this buffer, compress this block. Submitted asynchronously; completion fires a callback via the progress engine.
A hardware match-action pipeline stage in DOCA Flow. Defines match fields (5-tuple, VLAN, metadata) and default actions (forward, drop, modify, meter).
Packet arrives on p0/p1 or from host VF. DMA into DPU buffer.
5-tuple, VLAN, GRE key, metadata match in TCAM/exact tables.
Modify headers, encap/decap VXLAN, set metadata, meter (rate limit).
Chain to next pipe — e.g., ACL → NAT → firewall → counter.
Per-flow byte/packet counters. Exported via DOCA Telemetry.
Egress to port p0/p1, VF, SF, or drop. RSS for multi-queue.
// 1. Create a pipe with 5-tuple match + VXLAN encap action struct doca_flow_pipe_cfg pipe_cfg = { .name = "vxlan_encap", .port = doca_port, // port opened via doca_flow_port_start() .match = { .outer.ip4.dst_ip = 0xFFFFFFFF, // match exact dst IP .outer.tcp.dst_port = 0xFFFF, // match exact dst port }, .actions = { .encap_type = DOCA_FLOW_ENCAP_VXLAN, .encap.tun.vxlan.tun_id = vni, }, .fwd = { .type = DOCA_FLOW_FWD_PORT, .port_id = 0 }, }; // 2. Create pipe struct doca_flow_pipe *pipe; doca_flow_pipe_create(&pipe_cfg, NULL, NULL, &pipe); // 3. Add an entry (specific flow) struct doca_flow_match entry_match = { .outer.ip4.dst_ip = htonl(0x0a000001), // 10.0.0.1 .outer.tcp.dst_port = htons(8080), }; doca_flow_pipe_add_entry(0, pipe, &entry_match, NULL, NULL, NULL, 0, NULL, &entry); // 4. Progress engine drives async completions while (running) { doca_pe_progress(pe); // harvest callbacks, no blocking }
DPDK gives user-space PMD access to the ConnectX NIC for packet processing. DOCA sits on top — it provides higher-level APIs for the DPU's hardware accelerators (crypto, compress, regex, RDMA) and abstracts the async task model. DOCA Flow in particular replaces hand-written DPDK rte_flow rules with a more portable pipe concept.
ASAP² offloads the Open vSwitch (OVS) forwarding datapath from the host CPU to the BlueField eSwitch (embedded switch in ConnectX-7). The result: near-zero host CPU for packet forwarding.
Packet: NIC → host kernel → OVS datapath → back to NIC
Host CPU: 20-30% consumed by OVS forwarding
Latency: kernel crossing ×2 per packet
Packet: NIC eSwitch matches flow table → forwards in hardware
Host CPU: <1% consumed (only first packet per flow)
Latency: cut-through in NIC hardware
| IPsec Acceleration Details | |
|---|---|
| Algorithm | AES-256-GCM (AEAD) |
| Throughput | 400 Gbps inline (BF-3) |
| Mode | Tunnel mode (ESP) |
| Host CPU cycles | Zero — fully in DPU HW |
| SA management | DOCA IPsec API / strongSwan |
| Key negotiation | IKEv2 on DPU Arm (strongSwan) |
| Use case | Encrypted overlay (overlay IPsec) |
| Standards | RFC 4303 (ESP), RFC 7296 (IKEv2) |
Morpheus combines BlueField DPU telemetry with GPU-accelerated AI inference to detect threats in real time — without a performance impact on application workloads.
| Feature | Standard SmartNIC | BlueField DPU (Separated Mode) | BlueField SuperNIC (AI Mode) |
|---|---|---|---|
| Primary purpose | NIC offload (ASIC/FPGA) | Infrastructure isolation + offload | Max AI network performance |
| Arm CPU cores | Few or none | 16× Arm A78 — full OS | 16× Arm A78 — vendor FW only |
| Host isolation | None | Full — host treated as tenant | None (host sees full NIC) |
| DOCA SDK | No | Full DOCA 2.x support | Limited (network-focused) |
| OVS / ASAP² | Partial | Full ASAP² + DOCA Flow | Limited |
| IPsec inline | Sometimes | 400 Gbps hardware AES-GCM | Available |
| GPUDirect RDMA | Sometimes | Yes (via ConnectX-7) | Optimized — primary use case |
| Network speed (BF-3) | Varies | 400GbE | 400GbE — full line rate for AI |
| Use case | Basic offload | Cloud/Security/Storage | AI cluster / Spectrum-X |
| Specification | BlueField-2 | BlueField-3 |
|---|---|---|
| Arm CPU | 8× Arm A72 (Armv8) | 16× Arm A78 (Armv8.2+) |
| Network speed | 200GbE (2 × 100GbE) | 400GbE (2 × 200GbE) |
| NIC controller | ConnectX-6 Dx | ConnectX-7 |
| PCIe | Gen4 ×16 | Gen5 ×16 (2× BW) |
| Local DRAM | Up to 16 GB DDR4 | Up to 32 GB DDR5 |
| Crypto throughput | 200 Gbps inline IPsec | 400 Gbps inline IPsec |
| SuperNIC mode | No | Yes |
| DOCA version | DOCA 1.x | DOCA 2.x |
| GPUNetIO | Limited | Full DOCA GPUNetIO |
| Morpheus support | Partial | Full integration |
A DPU is not just a NIC — it has a full Arm CPU subsystem, hardware accelerators, and an independent OS. Think of it as a server-on-a-PCIe-card that manages the infrastructure so the host CPU doesn't have to.
ASAP² offloads Open vSwitch (OVS) forwarding from host CPU software to the BlueField eSwitch hardware. First packet: slow-path to OVS. All subsequent packets: line-rate in hardware eSwitch.
16 Arm A78 cores, ConnectX-7 NIC at 400GbE, PCIe Gen5. BF-2 was half: 8 × A72, ConnectX-6 Dx, 200GbE, Gen4.
"Separated mode" = Host OS and DPU Arm OS are fully isolated. This is the key security property that makes BlueField suitable for multi-tenant clouds — the cloud operator controls the DPU, the tenant controls only their VMs.
The Progress Engine (doca_pe) is poll-based — call doca_pe_progress() in your event loop to harvest task completions. Similar to DPDK's rx_burst model. No blocking, no interrupts.
DOCA GPUNetIO allows the NIC DMA engine to write incoming packets directly into GPU memory. The GPU CUDA kernel processes packets without the CPU ever touching them. Used for line-rate AI inference on network traffic.
Morpheus: DPU captures all traffic telemetry at line rate → streams to GPU → GPU runs transformer threat-detection models → DPU enforces block/alert. No host CPU involved in either capture or inference.
DOCA Flow processes packets through chained pipes. Each pipe has match fields (5-tuple, VLAN, metadata) and actions (encap, modify, meter, drop, forward). Think of it as a programmable hardware switch pipeline.
Select your scenario for targeted guidance.
mlxconfig -d /dev/mst/mt41686_pciconf0 q | grep INTERNAL_CPU_MODEL — should return EMBEDDED_CPU(1).openvswitch package from MLNX-OFED is used (not distro OVS — it won't have offload support).ovs-vsctl add-br br0 && ovs-vsctl add-port br0 pf0hpf && ovs-vsctl add-port br0 pf0vf0.ovs-vsctl set Open_vSwitch . other_config:hw-offload=true then restart OVS.ovs-appctl dpctl/dump-flows type=offloaded — flows with hardware offload active will appear.top under heavy traffic.ovs-vsctl set Open_vSwitch . other_config:ct-size=... — offloads connection state table to eSwitch hardware.ip -s xfrm state on host — look for offload hw flag in SA output.ip macsec commands — BlueField handles encryption/decryption per-port in ConnectX-7 hardware at zero CPU cost.nvcr.io/nvidia/doca/doca) or the DEB package from NVIDIA's developer portal.doca_[lib]_create() → doca_ctx_dev_add() → doca_pe_create() → doca_ctx_set_pe() → doca_ctx_start() → submit tasks → doca_pe_progress() loop.doca_mmap_create() → doca_mmap_set_memrange() → doca_mmap_start() — required for both host and DPU memory regions.doca_flow_init() before any port/pipe operations. Start ports before creating pipes. Create pipe before adding entries./opt/mellanox/doca/samples/ on the BF Arm — start with doca_compress or doca_flow_drop before moving to complex pipelines.doca_log_level_set_global(DOCA_LOG_LEVEL_DEBUG) at program start to see per-library trace output.nvcr.io/nvidia/morpheus/morpheus.