nvml gpu topo

nvmlGpuTopologyLevel_t

typedef enum nvmlGpuLevel_enum
{
    NVML_TOPOLOGY_INTERNAL           = 0, // e.g. Tesla K80
    NVML_TOPOLOGY_SINGLE             = 10, // all devices that only need traverse a single PCIe switch
    NVML_TOPOLOGY_MULTIPLE           = 20, // all devices that need not traverse a host bridge
    NVML_TOPOLOGY_HOSTBRIDGE         = 30, // all devices that are connected to the same host bridge
    NVML_TOPOLOGY_NODE               = 40, // all devices that are connected to the same NUMA node but possibly multiple host bridges
    NVML_TOPOLOGY_SYSTEM             = 50  // all devices in the system

    // there is purposefully no COUNT here because of the need for spacing above
} nvmlGpuTopologyLevel_t;

NVML_TOPOLOGY_INTERNAL
GPUs are on the same board (e.g., dual-GPU card). Fastest connection.

NVML_TOPOLOGY_SINGLE
GPUs share the same PCIe switch, typically the same PCIe slot/complex. Lower latency and higher bandwidth compared to higher levels.

NVML_TOPOLOGY_MULTIPLE
GPUs are under different PCIe switches but still below the same host bridge. This adds more PCIe hops.

NVML_TOPOLOGY_HOSTBRIDGE
GPUs are connected through a host bridge. A “host bridge” connects the CPU/system root complex to one or more PCIe hierarchies.
→ So two GPUs connected via host bridge means: they are still on the same host, but their PCIe paths diverge only above the level of switches, at the root complex.

NVML_TOPOLOGY_NODE
GPUs are in different NUMA nodes. This means the CPU sockets differ, so the GPUs are attached to different root complexes. Latency is higher.

NVML_TOPOLOGY_SYSTEM
GPUs are only reachable through the full system interconnect. This is the farthest relationship.
→ Still within the same physical host (unless you’re in a virtualized/multi-host environment with NVSwitch across nodes, which is rare outside DGX SuperPOD).

what is dual gpu

A dual-GPU card means a single expansion card (a single PCB that you plug into a PCIe slot) that carries two separate GPU dies (packages), often with their own memory, power regulators, and cooling, but sharing the same board and PCIe interface.

Examples:

NVIDIA GeForce GTX 690 (Kepler, 2012) → 2 × GK104 GPUs on one board

NVIDIA Tesla K80 (Kepler, 2014) → 2 × GK210 GPUs on one board, often used in datacenters

AMD Radeon HD 7990 (Tahiti, 2013) → 2 × Tahiti GPUs

Why it matters for NVML:

If you query topology with NVML, two GPUs on the same PCB (dual-GPU card) will usually return
NVML_TOPOLOGY_INTERNAL → meaning closest possible connection, since they may share a PCIe bridge chip directly on the card.

So yes — your interpretation is correct:
👉 Dual-GPU card = two GPU packages on the same PCB board, plugged into a single PCIe slot.

What is a PCIe switch?

A hardware component that fans out PCIe lanes. Think of it like an Ethernet switch but for PCIe. Multiple devices (GPUs, NICs, NVMe) can attach under a PCIe switch.

What is a host bridge?

The component that connects CPU/system memory (root complex) to one or more PCIe hierarchies. In multi-socket servers, each CPU typically has its own host bridge.

GPUs connected by host bridge lines: same host or different host?

Same host. They are just under different PCIe hierarchies attached to the same CPU root complex.

All devices in the system lines: same host or different host?

Same host. NVML_TOPOLOGY_SYSTEM means "the farthest possible connection within this system."
It doesn’t mean cross-host. NVML itself is per-host and does not describe connections between different physical servers.

graph TD
    CPU["CPU / Root Complex<br/>(PCIe Controller)"]
    Switch["PCIe Switch<br/>(fan-out)"]

    CPU -->|x16 lanes| Switch

    subgraph "PCIe Slots"
        Slot1["PCIe Slot x16<br/>(16 lanes)"]
        Slot2["PCIe Slot x8<br/>(8 lanes)"]
        Slot3["PCIe Slot x4<br/>(4 lanes)"]
        Slot4["PCIe Slot x1<br/>(1 lane)"]
    end

    Switch -->|x16 lanes| Slot1
    Switch -->|x8 lanes| Slot2
    Switch -->|x4 lanes| Slot3
    Switch -->|x1 lane| Slot4

can a numa node have multiple cpu socket?

Yes — but it depends on the system architecture.

Most common today:
A NUMA node = 1 CPU socket + its directly attached memory.
This is the standard mapping in modern x86 servers (Intel Xeon, AMD EPYC). Each socket is its own NUMA node.

Possible but less common:
A NUMA node can include multiple sockets if the firmware/OS groups them that way. This was more typical in older systems (e.g., some SGI or IBM big-iron machines) or if the BIOS is set to “NUMA = off” / “Node Interleaving.” In that case, the OS may see 1 NUMA node spanning multiple sockets.

Also possible:
A single socket can expose multiple NUMA nodes. Example:

AMD EPYC (Naples/Rome/Milan) has multiple “CCDs” (chiplets). Each CCD can appear as its own NUMA node, even though they’re on the same physical socket.

👉 So:

By default: 1 NUMA node ≈ 1 CPU socket.

But depending on system design or BIOS/firmware config:

One node can span multiple sockets, or

One socket can be split into multiple NUMA nodes.

nvidia-smi topo输出

55006|JYTFY-D1-308-H100-D01-4|2025-09-22 10:36:21[like@ ~]nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0,2,4,6,8,10    0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PIX     NODE    SYS     SYS     0,2,4,6,8,10    0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    PIX     SYS     SYS     0,2,4,6,8,10    0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    SYS     SYS     0,2,4,6,8,10    0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     NODE    NODE    1,3,5,7,9,11    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     PIX     NODE    1,3,5,7,9,11    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     NODE    NODE    1,3,5,7,9,11    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     NODE    PIX     1,3,5,7,9,11    1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC1    NODE    NODE    PIX     NODE    SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     NODE    PIX     NODE    NODE    SYS     SYS      X      NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    NODE    PIX     SYS     SYS     NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3

siml/nvml/nvidia-smi topo中的topo对齐

# same host, cross numa
SMI_PATH_SYS = 6, ///< Cross-NUMA connection
NVML_TOPOLOGY_SYSTEM             = 50  // all devices in the system
SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)

# same numa, cross host bridge. host bridge=root complex in wikipedia
    SMI_PATH_NODE = 5, ///< NUMA node internal
    NVML_TOPOLOGY_NODE               = 40, // all devices that are connected to the same NUMA node but possibly multiple host bridges
    NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node

# same host bridge
    SMI_PATH_PHB = 4, ///< PCIe Host Bridge
    NVML_TOPOLOGY_HOSTBRIDGE         = 30, // all devices that are connected to the same host bridge
     PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
The label PHB indicates that data must traverse the PCIe Host Bridge, typically meaning the CPU. This path incurs some latency because data must pass through the CPU before reaching its destination.

# different PCIe switches but still below the same host bridge
    SMI_PATH_PXB = 3, ///< Multiple PCIe bridges
    NVML_TOPOLOGY_MULTIPLE           = 20, // all devices that need not traverse a host bridge
    PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
#

    SMI_PATH_PIX = 2, ///< Single PCIe bridge
    NVML_TOPOLOGY_SINGLE             = 10, // all devices that only need traverse a single PCIe switch
PIX  = Connection traversing at most a single PCIe bridge

如何查服务器上有没有InfiniBand

ibstat

https://static.189505.xyz/blogTexts/ibstat.verbose.siorigin.h100.txt

ibv_devinfo

https://static.189505.xyz/blogTexts/ibv_devinfo.verbose.siorigin.h100.txt

lspci | grep -i mell

https://static.189505.xyz/blogTexts/lspcivv.verbose.siorigin.h100.txt

ip link show

https://static.189505.xyz/blogTexts/ip.link.show.siorigin.h100.log

ethtool -i

for I in `ip link show | grep ibp | awk -F: '{print $2}'`; do echo "=========== eth:$I"; ethtool -i $I; done 

https://static.189505.xyz/blogTexts/ethtool.siorigin.h100.txt

Leave a Comment