Skip to content

Releases: leptonai/gpud

gpud-v0.1.2

30 Oct 16:27
b8550d7
Compare
Choose a tag to compare

GPUd release notes (2024-10-30T16:26:40Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/error/xid): correctly describe reason for non-empty dmesg errors by @gyuho in #141
  • feat(nvidia/error/xid): provide reason in JSON format with detailed error information by @gyuho in #142
  • feat(nvidia/infiniband): use GPU product name to decide ib support by @gyuho in #143
  • fix(components/*): do not mark unhealthy if no data yet by @gyuho in #144

Full Changelog: v0.1.1...v0.1.2

gpud-v0.1.1

29 Oct 13:36
52da171
Compare
Choose a tag to compare

GPUd release notes (2024-10-29T13:35:25Z)

Welcome to this new release!

What's Changed

  • nits(internal/server): clean up xid/dmesg component dependency logic by @gyuho in #139
  • fix(nvidia/infiniband): do not set unhealthy if infiniband is not supported by @gyuho in #140

Full Changelog: v0.1.0...v0.1.1

gpud-v0.1.0

27 Oct 09:39
a9d8b90
Compare
Choose a tag to compare

GPUd release notes (2024-10-27T09:38:10Z)

Welcome to this new release!

What's Changed

  • nits(server): debug level log for redundant register attempts by @gyuho in #126
  • fix(nvidia-smi/parse): do not parse remapped rows N/A by @gyuho in #128
  • feat(component/network): latency checks to global edge/DERP servers (using tailscale) by @gyuho in #125
  • fix(containerd): readable query failure error message (When CRI is not set up) by @gyuho in #129
  • fix(components): do not panic when there's no data collected yet by @gyuho in #130
  • feat(nvidia): exposing SM core and tensor core metrics in GPUd by @photoszzt in #132
  • fix(nvidia/query/metrics): remove duplicate metric register call by @gyuho in #133
  • feat(charts): add gpud run helm chart by @gyuho in #123
  • fix(infiniband): simplify ibstat existence when evaluating healthy by @gyuho in #124
  • feat(network/latency): track latency in metrics per region by @gyuho in #134
  • Update mothership endpoint by @cardyok in #82
  • fix(nvidia): use NVML + lspci to detect NVIDIA GPUs (without running nvidia-smi) by @gyuho in #127
  • fix(server): handle "components" URL query, return 404 not found on unknown component queries by @gyuho in #131
  • nits(nvidia/query): make detect logs debug level by @gyuho in #135
  • fix(status): fix divide by zero by @cardyok in #136
  • fix(nvidia/xid): do not error log when no xid happened yet by @gyuho in #138
  • fix(nvidia): persistence mode check based on NVML, do not rely on "nvidia-persistenced" binary by @gyuho in #137

New Contributors

Full Changelog: v0.0.5...v0.1.0

gpud-v0.0.5

16 Oct 02:57
e63d5c7
Compare
Choose a tag to compare

GPUd release notes (2024-10-16T02:56:33Z)

Welcome to this new release!

What's Changed

  • feat(nvidia/xid): add check user app and GPU action type, apply "Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning by DeepSeek AI" by @gyuho in #100
  • fix(reboot): sudo typo by @cardyok in #107
  • fix(components/accelerator-nvidia-ecc): do not unhealthy when driver recovers uncorrectable ecc errors by @gyuho in #108
  • feat(nvidia/infiniband): suggest repair hardware for infiniband switch down by @gyuho in #112
  • fix(systemd, nvidia): mark dbus connection not available if not initialized, to avoid nil pointer panic by @gyuho in #104
  • fix(install.sh): fix install doc links by @gyuho in #105
  • fix(install.sh): print download failure debugging info by @gyuho in #106
  • feat(nvidia/ibstat): check "Physical state" as fallback by @gyuho in #99
  • feat(components): add accelerator detect func, "gpud accelerator" subcommand by @gyuho in #95
  • fix(fabric manager, nccl): fix fabric manager regex, add NCCL monitoring using dmesg by @gyuho in #113
  • fix(session): close writer goroutine by @cardyok in #116
  • nits(nvidia): fix Xid comments in error descriptions by @gyuho in #118
  • feat(reboot): support optional delay reboot (reboot immediately by default) by @gyuho in #115
  • feat(nvidia/nvml): update nvlib to 0.7.0, rename device ID fields by @gyuho in #114
  • feat(nvidia/info): report GPU device count from "/dev" (DCGM_FR_DEVICE_COUNT_MISMATCH DCGM) by @gyuho in #111
  • feat(nvidia): inspect process zombie status, bad env vars for CUDA per process (DCGM_FR_BAD_CUDA_ENV) by @gyuho in #119
  • fix(nvidia): suggest reboot for Xid 45, async nvidia-smi checks to not be stuck by @gyuho in #117
  • fix(k8s/pod): "gpud run --kubelet-ignore-connection-errors" to not mark unhealthy when read only port is not open (does not ignore by default) by @gyuho in #109
  • feat(nvidia): add bad-envs component for DCGM_FR_BAD_CUDA_ENV logic in DCGM by @gyuho in #121
  • feat(components/library): periodically check libnvidia/libcuda* (experimental) by @gyuho in #101
  • fix(docker-container): "gpud run --docker-ignore-connection-errors" to ignorer docker daemon connection errors (do not ignore by default) by @gyuho in #110
  • feat(nvidia): add persistence-mode (both legacy, persistenced daemon checks), implements DCGM_FR_PERSISTENCE_MODE in DCGM by @gyuho in #120
  • feat(gpud): "gpud run --auto-update-exit-code" for daemon set auto update use case (optional) by @gyuho in #122

Full Changelog: v0.0.4...v0.0.5

gpud-v0.0.4

03 Oct 15:11
e1e0893
Compare
Choose a tag to compare

GPUd release notes (2024-10-03T15:10:02Z)

Welcome to this new release!

What's Changed

  • feat(pkg/process): change "New" function signature with op options, add more examples by @gyuho in #73
  • feat(nvidia/query): shorter timeouts for "nvidia-smi" calls by @gyuho in #88
  • fix(nvidia): return empty output object if smi/nvml is nil by @gyuho in #83
  • doc(nvidia/sxid): README to expain xid 79, sxid 20034 as an example by @gyuho in #85
  • nits(nvidia/query/nvml): remove unused GPUID fields by @gyuho in #79
  • feat(nvidia): add non-fatal sxid "20012" code, rename Detail.ID to SXID by @gyuho in #84
  • feat(nvidia): track row remapping, RMA/GPU reset status by @gyuho in #80
  • feat(nvidia/xid,sxid): rename Detail.ID to XID, add required actions for XID/SXID events by @gyuho in #81
  • doc(sxid): add more example events for gpu-operator by @gyuho in #91
  • feat(gpud): add "file" component that returns healthy when all specified files exist by @gyuho in #92
  • fix(components/fd): rename "fd_max_file_exists" to "fd_limit_supported", fix get limit on darwin by @gyuho in #93
  • feat(nvidia): track "ECC mode" (enabled/disabled) using nvidia-smi and NVML by @gyuho in #86
  • feat(nvidia/ecc): rename state name key to "ecc" (from ecc_errors) by @gyuho in #87
  • feat(server): allow custom uid with cli by @cardyok in #94
  • feat(nvidia/xid,sxid,remapped rows): add required actions field to /states, /events by @gyuho in #89
  • feat(internal/server): dynamically refresh containerd, docker, kubelet components by @gyuho in #78
  • feat(build, release): support Amazon Linux 2 and 2023 (experimental) by @gyuho in #97
  • feat(pkg/reboot): initial commit by @gyuho in #96
  • feat(session): support reboot method by @cardyok in #98

Full Changelog: v0.0.3...v0.0.4

gpud-v0.0.3

25 Sep 03:41
286b28e
Compare
Choose a tag to compare

GPUd release notes (2024-09-25T03:40:47Z)

Welcome to this new release!

What's Changed

  • feat(nvidia/peermem): track dmesg events for invalid context errors by @gyuho in #74
  • fix(power): fix power segfault by @cardyok in #76
  • fix(nvidia/peermem): do not decide health based on ibcore peermem module by @gyuho in #77

Full Changelog: v0.0.2...v0.0.3

gpud-v0.0.2

18 Sep 12:04
3a2d60a
Compare
Choose a tag to compare

GPUd release notes (2024-09-18T12:03:11Z)

Welcome to this new release!

What's Changed

  • feat(docker): list all containers in docker by @cardyok in #64
  • feat(pkg/systemd): remove redundant utils, move "pkg/update" by @gyuho in #67
  • feat(nvidia/fabric-manager): alert on nvlink multicast failures by @gyuho in #71
  • client(v1): move examples, add info by component by @gyuho in #65
  • feat(pkg/process): rename stop to abort, add systemd/journal utils by @gyuho in #68
  • feat(dmesg): add oom-kill:constraint regex for cri-containerd events by @gyuho in #70
  • feat(nvidia/query): fabric manager debugging info from journalctl by @gyuho in #69
  • feat(internal/session): add missing writer close for session writer by @gyuho in #66
  • fix(pkg/process): panic on wait before process initialization by @gyuho in #72

Full Changelog: v0.0.1...v0.0.2

gpud-v0.0.1

10 Sep 13:28
401c62b
Compare
Choose a tag to compare

GPUd release notes (2024-09-10T13:27:05Z)

Welcome to this new release!

What's Changed

  • doc(README): add badges, official links by @gyuho in #57
  • feat(systemd): enable gpud service by @cardyok in #59
  • fix(nvidia/query): handle error when lsmod reader is already closed for peermem checker by @gyuho in #60
  • fix(components/docker): do not set not healthy if docker client version incompatible by @gyuho in #62
  • fix(update): check update version in "gpud update" command by @hm2501 in #61
  • feat(client/v1): add basic get/read v1 API calls by @gyuho in #58
  • feat(goreleaser): use ubuntu 20.04 build as default linux artifact by @gyuho in #63

Full Changelog: v0.0.1-alpha9...v0.0.1

gpud-v0.0.1-alpha9

09 Sep 06:46
96f1684
Compare
Choose a tag to compare

GPUd release notes (2024-09-09T06:45:50Z)

Welcome to this new release!

What's Changed

  • doc(nvidia/error/xid): document how xid error is detected using dmesg by @gyuho in #46
  • fix(docs): pkg.go.dev links, add Makefile CGO_ENABLED=1 by @flyer103 in #47
  • feat(nvidia/nvml): include device uuid for xid event by @gyuho in #50
  • fix(nvidia/nvml): remove xid event polling gaps, log when event happens by @gyuho in #49
  • fix(nvidia/nvml): mark xid 68 as user app error, document by @gyuho in #51
  • fix(nvidia): skip clock events NVML check if not supported by old drivers by @gyuho in #48
  • fix(components/fd): use system-wide file descriptor limit, add default 1-million threshold limit, remove "_avg" metrics in fd component by @gyuho in #52
  • fix(accelerator/nvidia): panic when ibstat command fails, when recording errors by @gyuho in #53
  • fix(event): add timestamp for xid/sxid error event by @cardyok in #55
  • fix(session): handle io closed on write failure by @cardyok in #54
  • fix(accelerator/nvidia/gpm): add missing Healthy: true field by @gyuho in #56

New Contributors

Full Changelog: v0.0.1-alpha8...v0.0.1-alpha9

gpud-v0.0.1-alpha8

02 Sep 02:41
0eb10d6
Compare
Choose a tag to compare

GPUd release notes (2024-09-02T02:41:14Z)

Welcome to this new release!

What's Changed

  • fix(dmesg): fallback in case "dmesg --since" flag doesn't exist in older versions by @gyuho in #45

Full Changelog: v0.0.1-alpha7...v0.0.1-alpha8