Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nvidia): configurable nvidia-smi binary, ibstat binary, infiniband class dir paths for mock testing #310

Merged
merged 4 commits into from
Jan 21, 2025

Conversation

gyuho
Copy link
Collaborator

@gyuho gyuho commented Jan 20, 2025

/tmp/gpud scan \
--nvidia-smi-command 'curl -fsSL https://pkg.gpud.dev/test-nvidia-smi-h100-no-issue' \
--nvidia-smi-query-command 'curl -fsSL https://pkg.gpud.dev/test-nvidia-smi-query-rtx4090-no-issue' \
--ibstat-command 'curl -fsSL https://pkg.gpud.dev/test-ibstat-no-issue' \
--infiniband-class-directory '/tmp/gpud-test-ib-class'

/tmp/gpud scan \
--nvidia-smi-command 'curl -fsSL https://pkg.gpud.dev/test-nvidia-smi-rtx4090-error' \
--nvidia-smi-query-command 'curl -fsSL https://pkg.gpud.dev/test-nvidia-smi-query-hw-slowdown' \
--ibstat-command 'curl -fsSL https://pkg.gpud.dev/test-ibstat-some-down' \
--infiniband-class-directory '/tmp/gpud-test-ib-class'
NVML GPU-e610277c-a7ac-52b5-0d0a-326b15d9ae86

✘ NVML GSP firmware mode is disabled (supported: false)
✔ NVML persistence mode is enabled
✔ NVML found no hw slowdown error
✔ NVML found no ecc volatile uncorrected error
✔ NVML found no running process
✘ scanned nvidia-smi -- found 1 error(s)
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                    0 |
|ERR!   38C    P5    49W / 450W |   2021MiB / 23028MiB |      0%   E. Process |
✘ scanned nvidia-smi -- found 8 hardware slowdown error(s)
GPU 00000000:19:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:3B:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:4C:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:5D:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:9B:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:BB:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:CB:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active
GPU 00000000:DB:00.0: ClockEventReasons.HWSlowdown.PowerBrakeSlowdown Active

@gyuho gyuho added the wip - do not merge working in progress label Jan 20, 2025
@gyuho gyuho self-assigned this Jan 20, 2025
@gyuho gyuho changed the title feat(nvidia): configurable nvidia-smi, ibstat binary paths feat(nvidia): configurable nvidia-smi, ibstat binary paths for mock testing Jan 20, 2025
@gyuho gyuho changed the title feat(nvidia): configurable nvidia-smi, ibstat binary paths for mock testing feat(nvidia): configurable nvidia-smi binary, ibstat binary, infiniband class dir paths for mock testing Jan 20, 2025
@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch from 868722c to ef503cd Compare January 20, 2025 13:20
@gyuho gyuho removed the wip - do not merge working in progress label Jan 20, 2025
@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch 3 times, most recently from e077620 to c123863 Compare January 20, 2025 13:46
@leptonai leptonai deleted a comment from codecov-commenter Jan 20, 2025
@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch from c123863 to 37b3f64 Compare January 20, 2025 13:52
@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 10.62124% with 446 lines in your changes missing coverage. Please review.

Project coverage is 21.01%. Comparing base (8e9d625) to head (3d02e9a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
cmd/gpud/command/command.go 0.00% 51 Missing ⚠️
internal/server/server.go 0.00% 48 Missing ⚠️
components/accelerator/nvidia/query/options.go 0.00% 45 Missing ⚠️
components/accelerator/nvidia/query/query.go 0.00% 44 Missing ⚠️
components/diagnose/options.go 42.85% 16 Missing ⚠️
config/op_options.go 0.00% 16 Missing ⚠️
...ents/accelerator/nvidia/query/infiniband/ibstat.go 59.25% 8 Missing and 3 partials ⚠️
components/diagnose/scan.go 0.00% 10 Missing ⚠️
...omponents/accelerator/nvidia/bad-envs/component.go 0.00% 9 Missing ⚠️
...onents/accelerator/nvidia/clock-speed/component.go 0.00% 9 Missing ⚠️
... and 30 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #310      +/-   ##
==========================================
+ Coverage   20.71%   21.01%   +0.29%     
==========================================
  Files         318      300      -18     
  Lines       26900    26926      +26     
==========================================
+ Hits         5572     5658      +86     
+ Misses      20677    20604      -73     
- Partials      651      664      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch 3 times, most recently from 6da0c79 to 783c5be Compare January 20, 2025 14:21
NvidiaSMICommand string `json:"nvidia_smi_command"`
NvidiaSMIQueryCommand string `json:"nvidia_smi_query_command"`
IbstatCommand string `json:"ibstat_command"`
InfinibandClassDirectory string `json:"infiniband_class_directory"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are passing these everywhere. can we use an embedded struct

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ccding Addressed, PTAL. Thanks.

@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch 2 times, most recently from a3a0e61 to ec3b046 Compare January 21, 2025 10:10
@gyuho gyuho added this to the v0.4.0 milestone Jan 21, 2025
@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch 2 times, most recently from b05dcda to 61f548c Compare January 21, 2025 12:47
…nd class dir paths for mock testing

Signed-off-by: Gyuho Lee <[email protected]>
@gyuho gyuho force-pushed the test-binary-for-ibstat-nvidia-smi branch from 61f548c to 634c0b9 Compare January 21, 2025 13:10
Signed-off-by: Gyuho Lee <[email protected]>
@gyuho
Copy link
Collaborator Author

gyuho commented Jan 21, 2025

Will merge after end-to-end tests

gyuho added 2 commits January 21, 2025 21:56
Signed-off-by: Gyuho Lee <[email protected]>
Signed-off-by: Gyuho Lee <[email protected]>
@gyuho
Copy link
Collaborator Author

gyuho commented Jan 21, 2025

Screenshot 2025-01-21 at 10 01 39 PM

Tested with mock.

@gyuho gyuho merged commit 821faba into main Jan 21, 2025
4 checks passed
@gyuho gyuho deleted the test-binary-for-ibstat-nvidia-smi branch January 21, 2025 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants