[agent] Detect available GPU devices with WLM #33952

Stephanie0829 · 2025-02-11T20:17:15Z

What does this PR do?

Motivation

Describe how you validated your changes

Locally, logs show no GPU detected:

On pulumi instance with nvidia driver, logs show GPU was detected:

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr · 2025-02-11T20:49:08Z

Static quality checks ✅

Please find below the results from static quality gates

Info

Result	Quality gate	On disk size	On disk size limit	On wire size	On wire size limit
✅	static_quality_gate_agent_deb_amd64	845.02MiB	858.45MiB	203.59MiB	214.3MiB
✅	static_quality_gate_docker_agent_amd64	929.39MiB	942.69MiB	310.74MiB	321.56MiB

cit-pr-commenter · 2025-02-11T21:17:36Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 9578db4-3614-4a2c-9bce-a8a581180cf2

Baseline: c83bdcf
Comparison: 5f527e0
Diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	quality_gate_logs	% cpu utilization	+2.03	[-1.08, +5.14]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+1.85	[+1.79, +1.92]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.85	[-0.04, +1.74]	1	Logs
➖	quality_gate_idle	memory utilization	+0.82	[+0.79, +0.85]	1	Logs bounds checks dashboard
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.37	[-0.42, +1.17]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.04	[-0.74, +0.83]	1	Logs
➖	file_tree	memory utilization	+0.04	[-0.03, +0.11]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	+0.04	[-0.84, +0.91]	1	Logs
➖	file_to_blackhole_0ms_latency_http1	egress throughput	+0.03	[-0.87, +0.93]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	+0.02	[-0.62, +0.65]	1	Logs
➖	file_to_blackhole_0ms_latency_http2	egress throughput	+0.01	[-0.82, +0.84]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.03, +0.02]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.74, +0.71]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.02	[-0.31, +0.26]	1	Logs
➖	file_to_blackhole_1000ms_latency_linear_load	egress throughput	-0.16	[-0.63, +0.31]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	-0.20	[-0.28, -0.13]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	links
✅	file_to_blackhole_0ms_latency	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http1	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http1	memory_usage	10/10
✅	file_to_blackhole_0ms_latency_http2	lost_bytes	10/10
✅	file_to_blackhole_0ms_latency_http2	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency_linear_load	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	lost_bytes	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	lost_bytes	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	lost_bytes	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	quality_gate_idle	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10
✅	quality_gate_logs	lost_bytes	10/10
✅	quality_gate_logs	memory_usage	10/10

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check lost_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.

Stephanie0829 · 2025-02-11T21:32:55Z

pkg/process/procutil/gpu_detector.go

+
+func (g *GPUDetector) Run() {
+	// TODO: ensure this is correct
+	filter := workloadmeta.NewFilterBuilder().


Ensured the filter was configured correct by matching config for events sent by the collector to WLM store: https://github.com/DataDog/datadog-agent/pull/32109/files#diff-739a3320df37987b0114fdf0c00e0776dc531aff6cf2160a5c00685218e943b6R105

wiyu · 2025-02-11T22:02:16Z

pkg/process/procutil/gpu_detector.go

+				}
+				g.mu.Lock()
+				// TODO: change into a map storing GPU info
+				g.DetectedGPU = true


Instead of using a mutex, you can simply use an atomic. You also need to guard the reads maybe through a getter.

I'm going to transition from the boolean to a map (to hold pid : gpu_tags), and check if the map was empty to signify whether a GPU was detected. That way I don't have to get each GPU's information separately later on which is duplicate logic (each event here includes gpu metadata).

For the map, instead of an atomic, I'll shard the map (synchronize based on map key / using sync.Map etc.) to best guard the read/writes (golang doesn't provide atomic operations for maps).

we can just query WMS directly from the process checks. We don't need to store yet another map as this data should be available from WMS.

wiyu · 2025-02-11T22:03:01Z

pkg/process/procutil/gpu_detector.go

+// This product includes software developed at Datadog (https://www.datadoghq.com/).
+// Copyright 2016-present Datadog, Inc.
+
+package procutil


we may want to put this in a different package or a new pkg.

wiyu · 2025-02-11T22:03:46Z

pkg/process/checks/process.go

@@ -285,7 +295,13 @@ func (p *ProcessCheck) run(groupID int32, collectRealTime bool) (RunResult, erro
 	collectorProcHints := p.generateHints()
 	p.checkCount++

-	procsByCtr := fmtProcesses(p.scrubber, p.disallowList, procs, p.lastProcs, pidToCid, cpuTimes[0], p.lastCPUTime, p.lastRun, p.lookupIdProbe, p.ignoreZombieProcesses, p.serviceExtractor)
+	detectedGPU := false


you can just rely on the default value to be false.
var detectedGPU bool

wiyu · 2025-02-11T22:04:57Z

pkg/process/checks/process.go

@@ -137,6 +139,10 @@ func (p *ProcessCheck) Init(syscfg *SysProbeConfig, info *HostInfo, oneShot bool
 	}
 	p.containerProvider = sharedContainerProvider

+	log.Info("Initializing gpu detector from process check")
+	p.gpuDetector = procutil.NewGPUDetector(p.wmeta)


for now this is ok, but we'll want to use components and FX to inject this in instead of initializing this directly.

Update agent-payload version

762bed5

github-actions bot added team/container-intake fka Processes medium review PR review might take time labels Feb 11, 2025

Stephanie0829 changed the title ~~Stephanie/gpu tagging~~ [agent] Detect available GPU devices with WLM Feb 11, 2025

Stephanie0829 commented Feb 11, 2025

View reviewed changes

wiyu reviewed Feb 11, 2025

View reviewed changes

Stephanie0829 mentioned this pull request Feb 12, 2025

Bump agent-payload version to v5.0.144 #33951

Open

Add gpu tagging logic

9938b37

Stephanie0829 force-pushed the stephanie/gpu-tagging branch from 5f527e0 to 9938b37 Compare February 13, 2025 04:40

Remove duplicate tags

011c8a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[agent] Detect available GPU devices with WLM #33952

[agent] Detect available GPU devices with WLM #33952

Stephanie0829 commented Feb 11, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Feb 11, 2025 •

edited

Loading

cit-pr-commenter bot commented Feb 11, 2025

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

Stephanie0829 Feb 11, 2025 •

edited

Loading

wiyu Feb 11, 2025

Stephanie0829 Feb 12, 2025 •

edited

Loading

wiyu Feb 12, 2025

wiyu Feb 11, 2025

wiyu Feb 11, 2025

wiyu Feb 11, 2025

[agent] Detect available GPU devices with WLM #33952

Are you sure you want to change the base?

[agent] Detect available GPU devices with WLM #33952

Conversation

Stephanie0829 commented Feb 11, 2025 • edited Loading

What does this PR do?

Motivation

Describe how you validated your changes

Possible Drawbacks / Trade-offs

Additional Notes

agent-platform-auto-pr bot commented Feb 11, 2025 • edited Loading

Static quality checks ✅

Info

cit-pr-commenter bot commented Feb 11, 2025

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Stephanie0829 Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

wiyu Feb 11, 2025

Choose a reason for hiding this comment

Stephanie0829 Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

wiyu Feb 12, 2025

Choose a reason for hiding this comment

wiyu Feb 11, 2025

Choose a reason for hiding this comment

wiyu Feb 11, 2025

Choose a reason for hiding this comment

wiyu Feb 11, 2025

Choose a reason for hiding this comment

Stephanie0829 commented Feb 11, 2025 •

edited

Loading

agent-platform-auto-pr bot commented Feb 11, 2025 •

edited

Loading

Stephanie0829 Feb 11, 2025 •

edited

Loading

Stephanie0829 Feb 12, 2025 •

edited

Loading