You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I use dynolog to monitor gpu metrics, which uses the dcgmProfWatchFields and dcgmWatchFields. But i found:
after Xid=31 happend, the gpu_device_utilization and gpu_memory_utilization will return DCGM_FP64_BLANK
maxKeepAge and maxKeepSamples both 2
event if the xid disappears, the dcgm api still return DCGM_FP64_BLANK until restart dynolog or call dcgmProfResume manually.
there is no cache mechanism in dynolog, so i think it's the behaviour in DCGM.
ENV:
NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2
nv-hostengine --version
Version : 3.3.8
Build ID : 43
Build Date : 2024-09-03
Build Type : Release
Commit ID : be8d66b4318e1d5d6e31b67759dc924d1bc18681
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : c32a73e1865ecdfa6990a80f79a6dea9
The text was updated successfully, but these errors were encountered:
The device i am using is Tesla T4 dcgmi dmon -e 203,204,1001,1005 will cause error like:
Error setting watches. Result: -37: The third-party Profiling module returned an unrecoverable error
The reason could be some custom GPU virtualization techniques I have employed, and the application may have encountered some memory-related errors during access. Not sure, but the XID always come with some specified application using the gpu.
I use dynolog to monitor gpu metrics, which uses the dcgmProfWatchFields and dcgmWatchFields. But i found:
after Xid=31 happend, the gpu_device_utilization and gpu_memory_utilization will return DCGM_FP64_BLANK
event if the xid disappears, the dcgm api still return DCGM_FP64_BLANK until restart dynolog or call dcgmProfResume manually.
there is no cache mechanism in dynolog, so i think it's the behaviour in DCGM.
ENV:
The text was updated successfully, but these errors were encountered: