Releases: leptonai/gpud
Releases · leptonai/gpud
gpud-v0.4.4
GPUd release notes (2025-02-24T12:36:24Z)
Welcome to this new release!
What's Changed
- feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
- fix(os): log which command fails when context timeout in get by @gyuho in #398
- project(*): move non-component code to /pkg by @gyuho in #368
- feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
- nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
- feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
- nits(components): close dmesg watcher without sync.Once by @gyuho in #404
- feat(dmesg): log line processor for dmesg by @gyuho in #406
- fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
- fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
- fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
- feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
- test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
- fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
- test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
- feat(systemd): do not return error on uptime check failures by @gyuho in #427
- test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
- go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
- feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
- feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
- feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
- feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
- test(dmesg): output more details for flaky tests by @gyuho in #430
- feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
- feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
- feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
- feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
- feat(dmesg, log): support match func in log scanner by @gyuho in #436
- feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
- feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
- feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
- feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
- cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
- feat(xid/sxid): optimize state reason and error by @cardyok in #438
- charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
- feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
- feat(kubelet): rename component name by @cardyok in #445
- feat(session): do parallel component collection by @cardyok in #441
- fix(ibstat): add healthy state by @cardyok in #446
- feat(install): support install specific version by @cardyok in #447
- feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
- feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
- feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
- tests(error/xid): add more nvrm dmesg regex unit tests for Xid 119 by @gyuho in #452
- feat(pkg/process): support custom bash script file name and directory, change default tmp file name pattern by @gyuho in #451
- fix(fd): take file_max into limit setting by @cardyok in #453
- fix(fd): make degraded state healthy false by @cardyok in #454
- fix(fd): fix ut for degraded condition by @cardyok in #455
Full Changelog: v0.4.3...v0.4.4
gpud-v0.4.4-rc-3
GPUd release notes (2025-02-24T03:50:05Z)
Welcome to this new release!
What's Changed
- feat(nvidia-query/server): remove redundant channel select, log more by @gyuho in #449
- feat(nvidia-query): use "ERROR_ARGUMENT_VERSION_MISMATCH" to decide whether GPM is supported or not by @gyuho in #450
Full Changelog: v0.4.4-rc-2...v0.4.4-rc-3
gpud-v0.4.4-rc-2
GPUd release notes (2025-02-22T02:14:23Z)
Welcome to this new release!
What's Changed
- feat(install): support install specific version by @cardyok in #447
- feat(infiniband): use events store to track historical ibstat status by @gyuho in #448
Full Changelog: v0.4.4-rc-1...v0.4.4-rc-2
gpud-v0.4.4-rc-1
GPUd release notes (2025-02-21T05:17:39Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.4.4-rc-0...v0.4.4-rc-1
gpud-v0.4.4-rc-0
GPUd release notes (2025-02-20T14:59:05Z)
Welcome to this new release!
What's Changed
- feat(dmesg/watcher): set default stream buffer limit to 16KB by @gyuho in #397
- fix(os): log which command fails when context timeout in get by @gyuho in #398
- project(*): move non-component code to /pkg by @gyuho in #368
- feat(dmesg): dedup logs by seconds if same content, bump up log channel buffer by @gyuho in #399
- nits(components/memory): set missing message field in events, update "New" by @gyuho in #402
- feat(nvidia/nccl): use new dmesg poller (simplified) by @gyuho in #401
- nits(components): close dmesg watcher without sync.Once by @gyuho in #404
- feat(dmesg): log line processor for dmesg by @gyuho in #406
- fix(infiniband): do not call "ibstat" on empty thresholds by @gyuho in #414
- fix(library): locate "libnvidia-ml.so.1" for later nvidia drivers (>=565.57.01) for library checks, error if not found by @gyuho in #415
- fix(infiniband): set /states unhealthy when thresholds set but ibstat not found by @gyuho in #420
- feat(network/edge/derpmap): add Ashburn (Virginia), Nuremberg (Germany) by @gyuho in #421
- test(pkg/dmesg): make "TestDedupLogLines" deterministic in slow CI by @gyuho in #422
- fix(gpud-state): do not rollback if apiversion update committed successfully by @gyuho in #426
- test(dmesg): make peer mem logs watch tests less flaky by @gyuho in #417
- feat(systemd): do not return error on uptime check failures by @gyuho in #427
- test(pkg/dmesg): make dedup log line tests more deterministic with waits by @gyuho in #425
- go module: upgrade nvlib, gopsutil, add missing go-cache by @gyuho in #429
- feat(dmesg): set default dmesg watcher in log line processor if not specified by @gyuho in #428
- feat(memory): use dmesg log line processor (shared code) by @gyuho in #418
- feat(cpu): use new dmesg poller (simplified) by @gyuho in #405
- feat(nvidia/nccl): use new dmesg log line processor (simplified) by @gyuho in #419
- test(dmesg): output more details for flaky tests by @gyuho in #430
- feat(components/fd): use new dmesg poller (simplified) by @gyuho in #400
- feat(nvidia/peermem): use new dmesg poller (simplified) by @gyuho in #403
- feat(gpud update): trim space when fetching version from "_latest.txt" by @gyuho in #424
- feat(accelerator/nvidia): move nvidia-query dmesg helper for xid/sxid by @gyuho in #433
- feat(dmesg, log): support match func in log scanner by @gyuho in #436
- feat(dmesg): remove component, use match func diagnose/scan by @gyuho in #432
- feat(nvidia): only use NVML for GPU memory usage tracking by @gyuho in #416
- feat(nvidia): remove nvidia-smi fallback for row remapping issues by @gyuho in #407
- feat(disk/lsblk): more debug info when "lsblk --version" parse fails by @gyuho in #443
- cmd(gpud): support "run --log-file" with built-in log rotation by @gyuho in #437
- feat(xid/sxid): optimize state reason and error by @cardyok in #438
- charts/gpud: remove (that does not follow the "best security practice" -- we will work on newer version) by @gyuho in #444
- feat(reboot): introduce initializing state for critical components on reboot by @cardyok in #439
- feat(kubelet): rename component name by @cardyok in #445
- feat(session): do parallel component collection by @cardyok in #441
Full Changelog: v0.4.3...v0.4.4-rc-0
gpud-v0.4.3
GPUd release notes (2025-02-12T08:40:40Z)
Welcome to this new release!
What's Changed
- test(db/events): make retention purge unit tests less flaky by @gyuho in #389
- test(nvidia/query/nvml): add clock speed, errors unit tests using mock/nvml by @gyuho in #360
- test(errdefs): use more helpers, increase test coverage by @gyuho in #358
- tests(internal/session): add more unit tests with smaller functions by @gyuho in #359
- debug(infiniband): output raw ibstat when issue found by @gyuho in #392
- fix(process): do not exit read before reading all buffer, support larger initial buffer size for scanner by @gyuho in #393
- fix(xid/sxid): rely on last reboot first by @cardyok in #373
- test(rootkeys): add unit tests by @gyuho in #357
- test(client/v1): increase unit test coverage by @gyuho in #353
- feat(ib, disk): use combined output for ibstat, lsblk by @gyuho in #395
Full Changelog: v0.4.2...v0.4.3
gpud-v0.4.2
GPUd release notes (2025-02-10T18:35:30Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.4.1...v0.4.2
gpud-v0.4.1
GPUd release notes (2025-02-10T11:57:30Z)
Welcome to this new release!
What's Changed
- fix(nvml): parse driver version to handle the one without patch version by @gyuho in #374
- test(pkg/process): add more unit test cases by @gyuho in #356
- test(manager/packages): add unit test cases by @gyuho in #362
- test(pkg/systemd): add more unit tests to increasing coverage by @gyuho in #355
- feat(process): support "ExitCode" method by @gyuho in #376
- fix(infiniband): surface ibstat parse failures, remove "down" match fallback for healthiness by @gyuho in #377
- feat(infiniband): directly call ibstat for ib states by @gyuho in #381
- fix(infiniband): return explicit errors for command not found by @gyuho in #384
- fix(infiniband): unit tests with specified ib threshold (do not use global threshold) by @gyuho in #386
- feat(nvidia): remove ibstat call from shared poller by @gyuho in #385
- feat(gpud): "scan --check-ib", move port/rates to "infiniband" package by @gyuho in #383
Full Changelog: v0.4.0...v0.4.1
gpud-v0.4.0
GPUd release notes (2025-02-07T10:34:19Z)
Welcome to this new release!
What's Changed
- feat(server): vacuum sqlite only once a week by @gyuho in #301
- feat(nvidia/sxid-xid-state): dedup in memory by minute level to minimize contending db inserts by @gyuho in #302
- chore(build): support arm64 build for linux by @photoszzt in #303
- feat(info): track gpud process self resource usage (file descriptors, RSS, start time, db size) by @gyuho in #296
- test(nvidia/xid): add more unit tests for extracting device uuid by @gyuho in #305
- fix(*): add missing close on process that runs commands as bash, rename abort to close by @gyuho in #294
- fix(join): use local context for join bash by @cardyok in #309
- feat(operation): add codecov coverage workflow by @leoshi01 in #311
- nit(gitignore): add .DS_Store, remove test binary by @gyuho in #316
- fix(diagnose): fix "gpud scan" (add missing temp db creation) by @gyuho in #314
- chore(ci): merge codecov workflow files by @leoshi01 in #313
- feat(sqlite): disable vacuum for now, bump up retention period, track latency, qps, expose via /info component by @gyuho in #307
- feat(pkg/dmesg): simpler watcher by @gyuho in #319
- fix(dmesg): add fallback "dmesg" command if lower-case "-w" fails by @gyuho in #320
- feat(nvidia): configurable nvidia-smi binary, ibstat binary, infiniband class dir paths for mock testing by @gyuho in #310
- fix(nvidia/hw-slowdown): evaluate state based on clock events per-minute frequency for the last 10-minute by @gyuho in #304
- fix(tailscale): remove unused version checks by @gyuho in #299
- feat(nvidia): move event type to common, define xid/sxid event level in the list by @gyuho in #297
- feat(nvidia/ib): remove "gpud run --expected-port-states-nvidia-infiniband" flag, only keep the default detection for backward compatibility by @gyuho in #308
- feat(*): remove redundant utc call by @gyuho in #325
- chore(e2e): refactor E2E and mock nvml/nvidia-smi/lspci by @FillZpp in #326
- feat(os/component): indicate os component is healthy via /states by @gyuho in #329
- feat(hw_slowdown): update suggested action by @cardyok in #330
- fix(hw_slowdown): fix description by @cardyok in #331
- feat(internal/session): set ib ports/rates dynamically by @gyuho in #328
- feat(components/db): add common events store (similar to components.Event) by @gyuho in #332
- feat(os/events): use read-only for reboot events, use common db pkg by @gyuho in #322
- feat(nvidia/xid,sxid/dmesg): add dmesg log line matcher, xid/sxid extractor by @gyuho in #333
- feat(components/memory): use common db + dmesg poller for events, move out of "dmesg" component by @gyuho in #324
- feat(nvidia/infiniband): log dynamic ports/rates config updates by @gyuho in #335
- feat(xid): simplify xid component by @cardyok in #321
- fix(diagnose): scan to use "dmesg" with no time limit by @gyuho in #338
- feat(fd): improve reasons, human-readable by @gyuho in #344
- feat(nvidia): set shared nvidia poller once in server.go by @gyuho in #345
- test(client/v1, nviai/query): increase unit test coverage by @gyuho in #347
- feat(nvidia/query): pass events store to shared poller by @gyuho in #346
- feat(sxid): simplify sxid component by @cardyok in #337
- feat(pci): use common db pkg for events by @gyuho in #340
- feat(fuse): use common events store pkg by @gyuho in #339
- fix(info): correctly compute GPUd sqlite metrics delta by @gyuho in #349
- test(config): increase unit test coverage by @gyuho in #354
- test(pkg/sqlite): increase test coverage by @gyuho in #352
- feat(nvidia/xid): use common events DB for NVML-based xid watcher, disable NVML Xid event watcher in favor of "dmesg" watcher, deprecate redundant "error-xid-sxid" component by @gyuho in #343
- feat(nvidia/hw-slowdown): use common db pkg for events by @gyuho in #342
- fix(components/state): clean up update queries, increase unit test coverage, read last API version using SQLite rowid column by @gyuho in #361
- feat(file-descriptor): cleanup fd check, only keep proc based fd chekking by @cardyok in #364
- fix(nvidia/remapped-rows): do not check row remapping for 4090 and other unsupported GPUs by @gyuho in #351
- releaser(ci, go): use go 1.23.6, disable arm releaser for now by @gyuho in #366
- fix(build): fix arm64 build by @photoszzt in #367
- fix(infiniband): make default value -1, allow future override by @cardyok in #369
New Contributors
Full Changelog: v0.3.9...v0.4.0
gpud-v0.3.9
GPUd release notes (2025-01-10T14:50:02Z)
Welcome to this new release!
What's Changed
- fix(ci): bump up linux header deps by @gyuho in #292
- fix(nvml): handle "not supported" error to not fail-fast for NVML get calls by @gyuho in #291
Full Changelog: v0.3.8...v0.3.9