GPUd release notes (2025-02-24T12:36:24Z)
Welcome to this new release!
What's Changed
- feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
- fix(os): log which command fails when context timeout in get by @gyuho in #398
- project(*): move non-component code to /pkg by @gyuho in #368
- feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
- nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
- feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
- nits(components): close dmesg watcher without sync.Once by @gyuho in #404
- feat(dmesg): log line processor for dmesg by @gyuho in #406
- fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
- fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
- fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
- feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
- test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
- fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
- test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
- feat(systemd): do not return error on uptime check failures by @gyuho in #427
- test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
- go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
- feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
- feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
- feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
- feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
- test(dmesg): output more details for flaky tests by @gyuho in #430
- feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
- feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
- feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
- feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
- feat(dmesg, log): support match func in log scanner by @gyuho in #436
- feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
- feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
- feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
- feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
- cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
- feat(xid/sxid): optimize state reason and error by @cardyok in #438
- charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
- feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
- feat(kubelet): rename component name by @cardyok in #445
- feat(session): do parallel component collection by @cardyok in #441
- fix(ibstat): add healthy state by @cardyok in #446
- feat(install): support install specific version by @cardyok in #447
- feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
- feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
- feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
- tests(error/xid): add more nvrm dmesg regex unit tests for Xid 119 by @gyuho in #452
- feat(pkg/process): support custom bash script file name and directory, change default tmp file name pattern by @gyuho in #451
- fix(fd): take file_max into limit setting by @cardyok in #453
- fix(fd): make degraded state healthy false by @cardyok in #454
- fix(fd): fix ut for degraded condition by @cardyok in #455
Full Changelog: v0.4.3...v0.4.4