Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep returning DCGM_FP64_BLANK for some fields after xid happen. #202

Open
CormickKneey opened this issue Nov 21, 2024 · 4 comments
Open

Comments

@CormickKneey
Copy link

I use dynolog to monitor gpu metrics, which uses the dcgmProfWatchFields and dcgmWatchFields. But i found:

after Xid=31 happend, the gpu_device_utilization and gpu_memory_utilization will return DCGM_FP64_BLANK

maxKeepAge and maxKeepSamples both 2

event if the xid disappears, the dcgm api still return DCGM_FP64_BLANK until restart dynolog or call dcgmProfResume manually.

there is no cache mechanism in dynolog, so i think it's the behaviour in DCGM.

ENV:

NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2
nv-hostengine --version
Version : 3.3.8
Build ID : 43
Build Date : 2024-09-03
Build Type : Release
Commit ID : be8d66b4318e1d5d6e31b67759dc924d1bc18681
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c32a73e1865ecdfa6990a80f79a6dea9

@nikkon-dev
Copy link
Collaborator

@CormickKneey,

Could you provide information about the hardware you are using?

Also, could you verify that the dcgmi dmon -e 203,204,1001,1005 behaves similarly?

It would be helpful to understand which operations caused XID 31 if you could provide an example of the Cuda kernel.

@CormickKneey
Copy link
Author

Hi~ @nikkon-dev thanks for you response.

The device i am using is Tesla T4
dcgmi dmon -e 203,204,1001,1005 will cause error like:

Error setting watches. Result: -37: The third-party Profiling module returned an unrecoverable error

The reason could be some custom GPU virtualization techniques I have employed, and the application may have encountered some memory-related errors during access. Not sure, but the XID always come with some specified application using the gpu.

@nikkon-dev
Copy link
Collaborator

@CormickKneey,

Are you using the official DCGM build or a custom OSS from Github?

@CormickKneey
Copy link
Author

@nikkon-dev the official one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants