Skip to content

Releases: leptonai/gpud

gpud-v0.4.4

24 Feb 12:35
218e1f6
Compare
Choose a tag to compare

GPUd release notes (2025-02-24T12:36:24Z)

Welcome to this new release!

What's Changed

  • feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
  • fix(os): log which command fails when context timeout in get by @gyuho in #398
  • project(*): move non-component code to /pkg by @gyuho in #368
  • feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
  • nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
  • feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
  • nits(components): close dmesg watcher without sync.Once by @gyuho in #404
  • feat(dmesg): log line processor for dmesg by @gyuho in #406
  • fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
  • fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
  • fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
  • feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
  • test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
  • fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
  • test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
  • feat(systemd): do not return error on uptime check failures by @gyuho in #427
  • test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
  • go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
  • feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
  • feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
  • feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
  • feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
  • test(dmesg): output more details for flaky tests by @gyuho in #430
  • feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
  • feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
  • feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
  • feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
  • feat(dmesg, log): support match func in log scanner by @gyuho in #436
  • feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
  • feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
  • feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
  • feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
  • cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
  • feat(xid/sxid): optimize state reason and error by @cardyok in #438
  • charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
  • feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
  • feat(kubelet): rename component name by @cardyok in #445
  • feat(session): do parallel component collection by @cardyok in #441
  • fix(ibstat): add healthy state by @cardyok in #446
  • feat(install): support install specific version by @cardyok in #447
  • feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
  • feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
  • feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
  • tests(error/xid): add more nvrm dmesg regex unit tests for Xid 119 by @gyuho in #452
  • feat(pkg/process): support custom bash script file name and directory, change default tmp file name pattern by @gyuho in #451
  • fix(fd): take file_max into limit setting by @cardyok in #453
  • fix(fd): make degraded state healthy false by @cardyok in #454
  • fix(fd): fix ut for degraded condition by @cardyok in #455

Full Changelog: v0.4.3...v0.4.4

gpud-v0.4.4-rc-3

24 Feb 03:30
0ac5777
Compare
Choose a tag to compare

GPUd release notes (2025-02-24T03:50:05Z)

Welcome to this new release!

What's Changed

  • feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
  • feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450

Full Changelog: v0.4.4-rc-2...v0.4.4-rc-3

gpud-v0.4.4-rc-2

22 Feb 02:13
6762927
Compare
Choose a tag to compare

GPUd release notes (2025-02-22T02:14:23Z)

Welcome to this new release!

What's Changed

  • feat(install): support install specific version by @cardyok in #447
  • feat(infiniband): use events store to track historical ibstat status by @gyuho in #448

Full Changelog: v0.4.4-rc-1...v0.4.4-rc-2

gpud-v0.4.4-rc-1

21 Feb 05:16
c32af55
Compare
Choose a tag to compare

GPUd release notes (2025-02-21T05:17:39Z)

Welcome to this new release!

What's Changed

Full Changelog: v0.4.4-rc-0...v0.4.4-rc-1

gpud-v0.4.4-rc-0

20 Feb 14:58
aa9e17e
Compare
Choose a tag to compare

GPUd release notes (2025-02-20T14:59:05Z)

Welcome to this new release!

What's Changed

  • feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
  • fix(os): log which command fails when context timeout in get by @gyuho in #398
  • project(*): move non-component code to /pkg by @gyuho in #368
  • feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
  • nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
  • feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
  • nits(components): close dmesg watcher without sync.Once by @gyuho in #404
  • feat(dmesg): log line processor for dmesg by @gyuho in #406
  • fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
  • fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
  • fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
  • feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
  • test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
  • fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
  • test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
  • feat(systemd): do not return error on uptime check failures by @gyuho in #427
  • test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
  • go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
  • feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
  • feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
  • feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
  • feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
  • test(dmesg): output more details for flaky tests by @gyuho in #430
  • feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
  • feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
  • feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
  • feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
  • feat(dmesg, log): support match func in log scanner by @gyuho in #436
  • feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
  • feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
  • feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
  • feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
  • cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
  • feat(xid/sxid): optimize state reason and error by @cardyok in #438
  • charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
  • feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
  • feat(kubelet): rename component name by @cardyok in #445
  • feat(session): do parallel component collection by @cardyok in #441

Full Changelog: v0.4.3...v0.4.4-rc-0

gpud-v0.4.3

12 Feb 08:40
57fa27b
Compare
Choose a tag to compare

GPUd release notes (2025-02-12T08:40:40Z)

Welcome to this new release!

What's Changed

  • test(db/events): make retention purge unit tests less flaky by @gyuho in #389
  • test(nvidia/query/nvml): add clock speed, errors unit tests using mock/nvml by @gyuho in #360
  • test(errdefs): use more helpers, increase test coverage by @gyuho in #358
  • tests(internal/session): add more unit tests with smaller functions by @gyuho in #359
  • debug(infiniband): output raw ibstat when issue found by @gyuho in #392
  • fix(process): do not exit read before reading all buffer, support larger initial buffer size for scanner by @gyuho in #393
  • fix(xid/sxid): rely on last reboot first by @cardyok in #373
  • test(rootkeys): add unit tests by @gyuho in #357
  • test(client/v1): increase unit test coverage by @gyuho in #353
  • feat(ib, disk): use combined output for ibstat, lsblk by @gyuho in #395

Full Changelog: v0.4.2...v0.4.3

gpud-v0.4.2

10 Feb 18:34
e0b05fa
Compare
Choose a tag to compare

GPUd release notes (2025-02-10T18:35:30Z)

Welcome to this new release!

What's Changed

  • fix(xid/sxid): only consider 3 day events and do not rely on purge by @cardyok in #390

Full Changelog: v0.4.1...v0.4.2

gpud-v0.4.1

10 Feb 11:43
f909795
Compare
Choose a tag to compare

GPUd release notes (2025-02-10T11:57:30Z)

Welcome to this new release!

What's Changed

  • fix(nvml): parse driver version to handle the one without patch version by @gyuho in #374
  • test(pkg/process): add more unit test cases by @gyuho in #356
  • test(manager/packages): add unit test cases by @gyuho in #362
  • test(pkg/systemd): add more unit tests to increasing coverage by @gyuho in #355
  • feat(process): support "ExitCode" method by @gyuho in #376
  • fix(infiniband): surface ibstat parse failures, remove "down" match fallback for healthiness by @gyuho in #377
  • feat(infiniband): directly call ibstat for ib states by @gyuho in #381
  • fix(infiniband): return explicit errors for command not found by @gyuho in #384
  • fix(infiniband): unit tests with specified ib threshold (do not use global threshold) by @gyuho in #386
  • feat(nvidia): remove ibstat call from shared poller by @gyuho in #385
  • feat(gpud): "scan --check-ib", move port/rates to "infiniband" package by @gyuho in #383

Full Changelog: v0.4.0...v0.4.1

gpud-v0.4.0

07 Feb 10:10
ab3785a
Compare
Choose a tag to compare

GPUd release notes (2025-02-07T10:34:19Z)

Welcome to this new release!

What's Changed

  • feat(server): vacuum sqlite only once a week by @gyuho in #301
  • feat(nvidia/sxid-xid-state): dedup in memory by minute level to minimize contending db inserts by @gyuho in #302
  • chore(build): support arm64 build for linux by @photoszzt in #303
  • feat(info): track gpud process self resource usage (file descriptors, RSS, start time, db size) by @gyuho in #296
  • test(nvidia/xid): add more unit tests for extracting device uuid by @gyuho in #305
  • fix(*): add missing close on process that runs commands as bash, rename abort to close by @gyuho in #294
  • fix(join): use local context for join bash by @cardyok in #309
  • feat(operation): add codecov coverage workflow by @leoshi01 in #311
  • nit(gitignore): add .DS_Store, remove test binary by @gyuho in #316
  • fix(diagnose): fix "gpud scan" (add missing temp db creation) by @gyuho in #314
  • chore(ci): merge codecov workflow files by @leoshi01 in #313
  • feat(sqlite): disable vacuum for now, bump up retention period, track latency, qps, expose via /info component by @gyuho in #307
  • feat(pkg/dmesg): simpler watcher by @gyuho in #319
  • fix(dmesg): add fallback "dmesg" command if lower-case "-w" fails by @gyuho in #320
  • feat(nvidia): configurable nvidia-smi binary, ibstat binary, infiniband class dir paths for mock testing by @gyuho in #310
  • fix(nvidia/hw-slowdown): evaluate state based on clock events per-minute frequency for the last 10-minute by @gyuho in #304
  • fix(tailscale): remove unused version checks by @gyuho in #299
  • feat(nvidia): move event type to common, define xid/sxid event level in the list by @gyuho in #297
  • feat(nvidia/ib): remove "gpud run --expected-port-states-nvidia-infiniband" flag, only keep the default detection for backward compatibility by @gyuho in #308
  • feat(*): remove redundant utc call by @gyuho in #325
  • chore(e2e): refactor E2E and mock nvml/nvidia-smi/lspci by @FillZpp in #326
  • feat(os/component): indicate os component is healthy via /states by @gyuho in #329
  • feat(hw_slowdown): update suggested action by @cardyok in #330
  • fix(hw_slowdown): fix description by @cardyok in #331
  • feat(internal/session): set ib ports/rates dynamically by @gyuho in #328
  • feat(components/db): add common events store (similar to components.Event) by @gyuho in #332
  • feat(os/events): use read-only for reboot events, use common db pkg by @gyuho in #322
  • feat(nvidia/xid,sxid/dmesg): add dmesg log line matcher, xid/sxid extractor by @gyuho in #333
  • feat(components/memory): use common db + dmesg poller for events, move out of "dmesg" component by @gyuho in #324
  • feat(nvidia/infiniband): log dynamic ports/rates config updates by @gyuho in #335
  • feat(xid): simplify xid component by @cardyok in #321
  • fix(diagnose): scan to use "dmesg" with no time limit by @gyuho in #338
  • feat(fd): improve reasons, human-readable by @gyuho in #344
  • feat(nvidia): set shared nvidia poller once in server.go by @gyuho in #345
  • test(client/v1, nviai/query): increase unit test coverage by @gyuho in #347
  • feat(nvidia/query): pass events store to shared poller by @gyuho in #346
  • feat(sxid): simplify sxid component by @cardyok in #337
  • feat(pci): use common db pkg for events by @gyuho in #340
  • feat(fuse): use common events store pkg by @gyuho in #339
  • fix(info): correctly compute GPUd sqlite metrics delta by @gyuho in #349
  • test(config): increase unit test coverage by @gyuho in #354
  • test(pkg/sqlite): increase test coverage by @gyuho in #352
  • feat(nvidia/xid): use common events DB for NVML-based xid watcher, disable NVML Xid event watcher in favor of "dmesg" watcher, deprecate redundant "error-xid-sxid" component by @gyuho in #343
  • feat(nvidia/hw-slowdown): use common db pkg for events by @gyuho in #342
  • fix(components/state): clean up update queries, increase unit test coverage, read last API version using SQLite rowid column by @gyuho in #361
  • feat(file-descriptor): cleanup fd check, only keep proc based fd chekking by @cardyok in #364
  • fix(nvidia/remapped-rows): do not check row remapping for 4090 and other unsupported GPUs by @gyuho in #351
  • releaser(ci, go): use go 1.23.6, disable arm releaser for now by @gyuho in #366
  • fix(build): fix arm64 build by @photoszzt in #367
  • fix(infiniband): make default value -1, allow future override by @cardyok in #369

New Contributors

Full Changelog: v0.3.9...v0.4.0

gpud-v0.3.9

10 Jan 14:51
b794199
Compare
Choose a tag to compare

GPUd release notes (2025-01-10T14:50:02Z)

Welcome to this new release!

What's Changed

  • fix(ci): bump up linux header deps by @gyuho in #292
  • fix(nvml): handle "not supported" error to not fail-fast for NVML get calls by @gyuho in #291

Full Changelog: v0.3.8...v0.3.9