当前位置: > 其它学习 > Nvidia >

NVIDIA GPU 监控观测最佳实践

时间:2024-09-26 20:23来源:linux.it.net.cn 作者:IT
1、DCGM 介绍
DCGM(Data Center GPU Manager)即数据中心 GPU 管理器,是一套用于在集群环境中管理和监视 Tesla™GPU 的工具。它包括主动健康监控,全面诊断,系统警报以及包括电源和时钟管理在内的治理策略。它可以由系统管理员独立使用,并且可以轻松地集成到 NVIDIA 合作伙伴的集群管理,资源调度和监视产品中。DCGM 简化了数据中心中的 GPU 管理,提高了资源可靠性和正常运行时间,自动化了管理任务,并有助于提高整体基础架构效率。DCGM 提供了种类丰富的 GPU 监控指标,有如下功能特性:
 
GPU 行为监控
GPU 配置管理
GPU Policy 管理
GPU 健康诊断
GPU 级别统计和线程级别统计
NVSwitch 配置和监控

 
2、使用限制
节点 NVIDIA GPU 驱动版本 ≥418.87.01。如果您需要收集 GPU Profiling,则节点 NVIDIA GPU 驱动版本 ≥450.80.02。关于 GPU Profiling 的更多信息,请参见  Feature Overview。
节点的 NVIDIA GPU 驱动版本不能为 5XX 系列(驱动版本以 5 开头,例如:510.47.03)。
您可以通过 SSH 登录 GPU 节点,执行 nvidia-smi 命令,查看安装的 GPU 驱动版本。
 
3、DCGM/dcgm-exporter 安装
3.1、docker 方式
3.1.1、安装 dcgm
tips:dcgm-exporter 可以连接到现有的 dcgm 代理,本次采用新建的方式连接到 dcgm 独立容器。
参考文档: 点击链接
 
docker run -d --gpus all --cap-add SYS_ADMIN -p 5556:5555 --name dcgm nvidia/dcgm:2.2.9-ubuntu20.04
 
3.1.2、安装 dcgm-exporter
docker run -d  --gpus all    --net host    --cap-add SYS_ADMIN -v /root/dcgm-exporter/dcp-metrics-included.csv:/etc/dcgmExporter/dcp-metrics-included.csv --name dcgm-exporter   nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04    -r    localhost:5556 -f /etc/dcgmExporter/dcp-metrics-included.csv
1.
/root/dcgm-exporter/dcp-metrics-included.csv 文件(该文件见附录 A),拷贝到宿主机该目录。
 
3.2、k8s 方式
将 GPU 所在 node 节点打 label。
 
# 节点设置 Label 标签
kubectl label nodes <gpu-node-name> nvidia-gpu=monitoring
# 查看节点是否设置 Label 成功
kubectl get nodes --show-labels
 
dcgm-metrics.yaml 见附录 B。
 
dcgm-exporter.yaml 见附录 C。
 
kubectl apply -f dcgm-metrics.yaml
kubectl apply -f dcgm-exporter.yaml
# 查看monitoring空间下,各资源状态
kubectl -n monitoring get svc,pod -l app.kubernetes.io/name=dcgm-exporter
 
4、指标暴露情况确认
调用 dcgm-exporter 接口,验证 GPU 指标获取情况;假设 172.16.0.114 为 pod/container 的 IP,显示数据如下,显示结果会根据 GPU 卡的数量不同而显示不同的记录数,如图为 8 张卡。
 
curl 172.16.0.114:9400/metrics | grep DCGM_FI_DEV_GPU_UTIL
 
同时可以验证获取 profiling 指标情况:
 
curl 172.16.0.114:9400/metrics | grep DCGM_FI_PROF_SM_ACTIVE
 
 
 
5、安装 datakit
# 需要把token 改成观测云空间的实际token值(可在「观测云控制台」-「集成」-「Datakit」 上面获取)
DK_DATAWAY="https://openway.guance.com?token=tkn_xxxxxx" bash -c "$(curl -L https://static.guance.com/datakit/install.sh)" 
# datakit安装成功后,开启prom采集器
cd /usr/local/datakit/conf.d/prom
cp prom.conf.sample prom.conf
 
# tips:根据情况修改urls数组,如果写127回环地址,上报观测云上host显示为目标机器的hostname,如果写真实IP地址,则host显示为IP地址
# 为了保证GPU指标都上报到同一个指标集,这里需要指定指标集为measurement_name = "gpu_dcgm"
 
# 本次使用prom采集器,当然,我们仍然可以使用servicemonitor自动发现功能。
 
# 重启datakit
datakit service -R
 
6、展示效果
 
 
附录
附录 A
# Format
# If line starts with a '#' it is considered a comment
# DCGM FIELD, Prometheus metric type, help message
 
# Clocks
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
 
# Temperature
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
 
# Power
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
 
# PCIE
# DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
# DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
 
# Utilization (the sample period varies depending on the product)
# DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
 
# Errors and violations
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
 
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
 
# ECC
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
 
# Retired pages
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
 
# NVLink
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.
 
# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
 
# Remapped rows
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
 
# Static configuration information. These appear as labels on the other metrics
DCGM_FI_DRIVER_VERSION,        label, Driver Version
# DCGM_FI_NVML_VERSION,          label, NVML Version
# DCGM_FI_DEV_BRAND,             label, Device Brand
# DCGM_FI_DEV_SERIAL,            label, Device Serial Number
# DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version
# DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version
# DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
# DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
# DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device
 
# DCP metrics
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
# DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
# DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
# DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
 
 
附录 B
# dcgm-metrics.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dcgm-metrics
  namespace: monitoring
data:
  default-counter.csv: |
    # Format
    # If line starts with a '#' it is considered a comment
    # DCGM FIELD, Prometheus metric type, help message
    
    # Clocks
    DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
    DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
    
    # Temperature
    DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
    DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
    
    # Power
    DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
    DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
    
    # PCIE
    # DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
    # DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
    DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
    
    # Utilization (the sample period varies depending on the product)
    # DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
    DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
    DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
    DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
    
    # Errors and violations
    DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
    # DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
    # DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
    # DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
    # DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
    # DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
    # DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
    
    # Memory usage
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
    
    # ECC
    # DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
    # DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
    # DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
    
    # Retired pages
    # DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
    # DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
    # DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
    
    # NVLink
    # DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
    # DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
    # DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
    # DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
    DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
    # DCGM_FI_DEV_NVLINK_BANDWIDTH_L0,               counter, The number of bytes of active NVLink rx or tx data including both header and payload.
    
    # VGPU License status
    DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
    
    # Remapped rows
    DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
    DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
    DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
    
    # Static configuration information. These appear as labels on the other metrics
    DCGM_FI_DRIVER_VERSION,        label, Driver Version
    # DCGM_FI_NVML_VERSION,          label, NVML Version
    # DCGM_FI_DEV_BRAND,             label, Device Brand
    # DCGM_FI_DEV_SERIAL,            label, Device Serial Number
    # DCGM_FI_DEV_OEM_INFOROM_VER,   label, OEM inforom version
    # DCGM_FI_DEV_ECC_INFOROM_VER,   label, ECC inforom version
    # DCGM_FI_DEV_POWER_INFOROM_VER, label, Power management object inforom version
    # DCGM_FI_DEV_INFOROM_IMAGE_VER, label, Inforom image version
    # DCGM_FI_DEV_VBIOS_VERSION,     label, VBIOS version of the device
    
    # DCP metrics
    DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
    DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
    DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
    # DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
    # DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
    DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data transmitted over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
    DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data received over the PCIe bus - including both protocol headers and data payloads - in bytes per second.
 
附录 C
复制 
# dcgm-exporter.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: "dcgm-exporter"
  namespace: "monitoring"      #请根据实际情况选择命名空间安装
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
spec:
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app.kubernetes.io/name: "dcgm-exporter"
      app.kubernetes.io/version: "2.4.0"
  template:
    metadata:
      labels:
        app.kubernetes.io/name: "dcgm-exporter"
        app.kubernetes.io/version: "2.4.0"
      name: "dcgm-exporter"
    spec:
      containers:
      - image: "nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04"
        env:
        - name: "DCGM_EXPORTER_LISTEN"                   # 服务端口号
          value: ":9400"
        - name: "DCGM_EXPORTER_KUBERNETES"               # 支持Kubernetes指标映射到Pod
          value: "true"
        name: "dcgm-exporter"
        ports:
        - name: "metrics"
          containerPort: 9400
        resources:      #建议根据实际情况配置资源使用申请值和限制值
          limits:
            cpu: '200m'
            memory: '256Mi'
          requests:
            cpu: 100m
            memory: 128Mi
        securityContext:      #需要给dcgm-exporter容器开启特权模式
          privileged: true
          runAsNonRoot: false
          runAsUser: 0
        volumeMounts:
        - name: "pod-gpu-resources"
          readOnly: true
          mountPath: "/var/lib/kubelet/pod-resources"
        - name: "gpu-metrics"
          readOnly: true
          mountPath: "/etc/dcgm-exporter"
      volumes:
      - name: "pod-gpu-resources"
        hostPath:
          path: "/var/lib/kubelet/pod-resources"
      - name: "gpu-metrics"
        configmap:
          name: "dcgm-metrics"
      nodeSelector:
        nvidia-gpu: "monitoring"
---
kind: Service
apiVersion: v1
metadata:
  name: "dcgm-exporter"
  namespace: "monitoring"
  labels:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
spec:
  selector:
    app.kubernetes.io/name: "dcgm-exporter"
    app.kubernetes.io/version: "2.4.0"
  ports:
  - name: "metrics"
    port: 9400
-----------------------------------
 
(责任编辑:IT)
------分隔线----------------------------