-
你可以在 grafana 中导入此 gpu-dashboard.json
-
此 dashboard 还包括一部分 NVIDIA DCGM 监控指标:
dcgm-exporter部署:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/dcgm-exporter/master/dcgm-exporter.yaml
-
添加 prometheus 自定义的监控项:
- job_name: 'kubernetes-vgpu-exporter'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: vgpu-device-plugin-monitor
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_node_name]
regex: (.*)
target_label: node_name
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_pod_host_ip]
regex: (.*)
target_label: ip
replacement: $1
action: replace
- job_name: 'kubernetes-dcgm-exporter'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: dcgm-exporter
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_pod_node_name]
regex: (.*)
target_label: node_name
replacement: ${1}
action: replace
- source_labels: [__meta_kubernetes_pod_host_ip]
regex: (.*)
target_label: ip
replacement: $1
action: replace
- 加载 promethues 配置:
curl -XPOST http://{promethuesServer}:{port}/-/reload