Skip to content

gpud-v0.4.4

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 24 Feb 12:35
· 28 commits to main since this release
218e1f6

GPUd release notes (2025-02-24T12:36:24Z)

Welcome to this new release!

What's Changed

  • feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
  • fix(os): log which command fails when context timeout in get by @gyuho in #398
  • project(*): move non-component code to /pkg by @gyuho in #368
  • feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
  • nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
  • feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
  • nits(components): close dmesg watcher without sync.Once by @gyuho in #404
  • feat(dmesg): log line processor for dmesg by @gyuho in #406
  • fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
  • fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
  • fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
  • feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
  • test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
  • fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
  • test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
  • feat(systemd): do not return error on uptime check failures by @gyuho in #427
  • test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
  • go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
  • feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
  • feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
  • feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
  • feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
  • test(dmesg): output more details for flaky tests by @gyuho in #430
  • feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
  • feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
  • feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
  • feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
  • feat(dmesg, log): support match func in log scanner by @gyuho in #436
  • feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
  • feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
  • feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
  • feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
  • cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
  • feat(xid/sxid): optimize state reason and error by @cardyok in #438
  • charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
  • feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
  • feat(kubelet): rename component name by @cardyok in #445
  • feat(session): do parallel component collection by @cardyok in #441
  • fix(ibstat): add healthy state by @cardyok in #446
  • feat(install): support install specific version by @cardyok in #447
  • feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
  • feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
  • feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
  • tests(error/xid): add more nvrm dmesg regex unit tests for Xid 119 by @gyuho in #452
  • feat(pkg/process): support custom bash script file name and directory, change default tmp file name pattern by @gyuho in #451
  • fix(fd): take file_max into limit setting by @cardyok in #453
  • fix(fd): make degraded state healthy false by @cardyok in #454
  • fix(fd): fix ut for degraded condition by @cardyok in #455

Full Changelog: v0.4.3...v0.4.4