Skip to content

Latest commit

 

History

History
53 lines (47 loc) · 1.55 KB

dashboard_cn.md

File metadata and controls

53 lines (47 loc) · 1.55 KB

Grafana Dashboard

  • 你可以在 grafana 中导入此 gpu-dashboard.json

  • 此 dashboard 还包括一部分 NVIDIA DCGM 监控指标:

    dcgm-exporter部署:kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml

  • 添加 prometheus 自定义的监控项:

- job_name: 'kubernetes-vgpu-exporter'
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels: [__meta_kubernetes_endpoints_name]
      regex: vgpu-device-plugin-monitor
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_pod_node_name]
      regex: (.*)
      target_label: node_name
      replacement: ${1}
      action: replace
    - source_labels: [__meta_kubernetes_pod_host_ip]
      regex: (.*)
      target_label: ip
      replacement: $1
      action: replace
- job_name: 'kubernetes-dcgm-exporter'
    kubernetes_sd_configs:
    - role: endpoints
    relabel_configs:
    - source_labels: [__meta_kubernetes_endpoints_name]
      regex: dcgm-exporter
      replacement: $1
      action: keep
    - source_labels: [__meta_kubernetes_pod_node_name]
      regex: (.*)
      target_label: node_name
      replacement: ${1}
      action: replace
    - source_labels: [__meta_kubernetes_pod_host_ip]
      regex: (.*)
      target_label: ip
      replacement: $1
      action: replace
  • 加载 promethues 配置:
curl -XPOST http://{promethuesServer}:{port}/-/reload