Releases: leptonai/gpud
Releases · leptonai/gpud
gpud-v0.1.2
GPUd release notes (2024-10-30T16:26:40Z)
Welcome to this new release!
What's Changed
- fix(nvidia/error/xid): correctly describe reason for non-empty dmesg errors by @gyuho in #141
- feat(nvidia/error/xid): provide reason in JSON format with detailed error information by @gyuho in #142
- feat(nvidia/infiniband): use GPU product name to decide ib support by @gyuho in #143
- fix(components/*): do not mark unhealthy if no data yet by @gyuho in #144
Full Changelog: v0.1.1...v0.1.2
gpud-v0.1.1
GPUd release notes (2024-10-29T13:35:25Z)
Welcome to this new release!
What's Changed
- nits(internal/server): clean up xid/dmesg component dependency logic by @gyuho in #139
- fix(nvidia/infiniband): do not set unhealthy if infiniband is not supported by @gyuho in #140
Full Changelog: v0.1.0...v0.1.1
gpud-v0.1.0
GPUd release notes (2024-10-27T09:38:10Z)
Welcome to this new release!
What's Changed
- nits(server): debug level log for redundant register attempts by @gyuho in #126
- fix(nvidia-smi/parse): do not parse remapped rows N/A by @gyuho in #128
- feat(component/network): latency checks to global edge/DERP servers (using tailscale) by @gyuho in #125
- fix(containerd): readable query failure error message (When CRI is not set up) by @gyuho in #129
- fix(components): do not panic when there's no data collected yet by @gyuho in #130
- feat(nvidia): exposing SM core and tensor core metrics in GPUd by @photoszzt in #132
- fix(nvidia/query/metrics): remove duplicate metric register call by @gyuho in #133
- feat(charts): add gpud run helm chart by @gyuho in #123
- fix(infiniband): simplify ibstat existence when evaluating healthy by @gyuho in #124
- feat(network/latency): track latency in metrics per region by @gyuho in #134
- Update mothership endpoint by @cardyok in #82
- fix(nvidia): use NVML + lspci to detect NVIDIA GPUs (without running nvidia-smi) by @gyuho in #127
- fix(server): handle "components" URL query, return 404 not found on unknown component queries by @gyuho in #131
- nits(nvidia/query): make detect logs debug level by @gyuho in #135
- fix(status): fix divide by zero by @cardyok in #136
- fix(nvidia/xid): do not error log when no xid happened yet by @gyuho in #138
- fix(nvidia): persistence mode check based on NVML, do not rely on "nvidia-persistenced" binary by @gyuho in #137
New Contributors
- @photoszzt made their first contribution in #132
Full Changelog: v0.0.5...v0.1.0
gpud-v0.0.5
GPUd release notes (2024-10-16T02:56:33Z)
Welcome to this new release!
What's Changed
- feat(nvidia/xid): add check user app and GPU action type, apply "Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning by DeepSeek AI" by @gyuho in #100
- fix(reboot): sudo typo by @cardyok in #107
- fix(components/accelerator-nvidia-ecc): do not unhealthy when driver recovers uncorrectable ecc errors by @gyuho in #108
- feat(nvidia/infiniband): suggest repair hardware for infiniband switch down by @gyuho in #112
- fix(systemd, nvidia): mark dbus connection not available if not initialized, to avoid nil pointer panic by @gyuho in #104
- fix(install.sh): fix install doc links by @gyuho in #105
- fix(install.sh): print download failure debugging info by @gyuho in #106
- feat(nvidia/ibstat): check "Physical state" as fallback by @gyuho in #99
- feat(components): add accelerator detect func, "gpud accelerator" subcommand by @gyuho in #95
- fix(fabric manager, nccl): fix fabric manager regex, add NCCL monitoring using dmesg by @gyuho in #113
- fix(session): close writer goroutine by @cardyok in #116
- nits(nvidia): fix Xid comments in error descriptions by @gyuho in #118
- feat(reboot): support optional delay reboot (reboot immediately by default) by @gyuho in #115
- feat(nvidia/nvml): update nvlib to 0.7.0, rename device ID fields by @gyuho in #114
- feat(nvidia/info): report GPU device count from "/dev" (
DCGM_FR_DEVICE_COUNT_MISMATCH
DCGM) by @gyuho in #111 - feat(nvidia): inspect process zombie status, bad env vars for CUDA per process (
DCGM_FR_BAD_CUDA_ENV
) by @gyuho in #119 - fix(nvidia): suggest reboot for Xid 45, async nvidia-smi checks to not be stuck by @gyuho in #117
- fix(k8s/pod): "gpud run --kubelet-ignore-connection-errors" to not mark unhealthy when read only port is not open (does not ignore by default) by @gyuho in #109
- feat(nvidia): add bad-envs component for
DCGM_FR_BAD_CUDA_ENV
logic in DCGM by @gyuho in #121 - feat(components/library): periodically check libnvidia/libcuda* (experimental) by @gyuho in #101
- fix(docker-container): "gpud run --docker-ignore-connection-errors" to ignorer docker daemon connection errors (do not ignore by default) by @gyuho in #110
- feat(nvidia): add persistence-mode (both legacy, persistenced daemon checks), implements
DCGM_FR_PERSISTENCE_MODE
in DCGM by @gyuho in #120 - feat(gpud): "gpud run --auto-update-exit-code" for daemon set auto update use case (optional) by @gyuho in #122
Full Changelog: v0.0.4...v0.0.5
gpud-v0.0.4
GPUd release notes (2024-10-03T15:10:02Z)
Welcome to this new release!
What's Changed
- feat(pkg/process): change "New" function signature with op options, add more examples by @gyuho in #73
- feat(nvidia/query): shorter timeouts for "nvidia-smi" calls by @gyuho in #88
- fix(nvidia): return empty output object if smi/nvml is nil by @gyuho in #83
- doc(nvidia/sxid): README to expain xid 79, sxid 20034 as an example by @gyuho in #85
- nits(nvidia/query/nvml): remove unused GPUID fields by @gyuho in #79
- feat(nvidia): add non-fatal sxid "20012" code, rename Detail.ID to SXID by @gyuho in #84
- feat(nvidia): track row remapping, RMA/GPU reset status by @gyuho in #80
- feat(nvidia/xid,sxid): rename Detail.ID to XID, add required actions for XID/SXID events by @gyuho in #81
- doc(sxid): add more example events for gpu-operator by @gyuho in #91
- feat(gpud): add "file" component that returns healthy when all specified files exist by @gyuho in #92
- fix(components/fd): rename "fd_max_file_exists" to "fd_limit_supported", fix get limit on darwin by @gyuho in #93
- feat(nvidia): track "ECC mode" (enabled/disabled) using nvidia-smi and NVML by @gyuho in #86
- feat(nvidia/ecc): rename state name key to "ecc" (from ecc_errors) by @gyuho in #87
- feat(server): allow custom uid with cli by @cardyok in #94
- feat(nvidia/xid,sxid,remapped rows): add required actions field to /states, /events by @gyuho in #89
- feat(internal/server): dynamically refresh containerd, docker, kubelet components by @gyuho in #78
- feat(build, release): support Amazon Linux 2 and 2023 (experimental) by @gyuho in #97
- feat(pkg/reboot): initial commit by @gyuho in #96
- feat(session): support reboot method by @cardyok in #98
Full Changelog: v0.0.3...v0.0.4
gpud-v0.0.3
GPUd release notes (2024-09-25T03:40:47Z)
Welcome to this new release!
What's Changed
- feat(nvidia/peermem): track dmesg events for invalid context errors by @gyuho in #74
- fix(power): fix power segfault by @cardyok in #76
- fix(nvidia/peermem): do not decide health based on ibcore peermem module by @gyuho in #77
Full Changelog: v0.0.2...v0.0.3
gpud-v0.0.2
GPUd release notes (2024-09-18T12:03:11Z)
Welcome to this new release!
What's Changed
- feat(docker): list all containers in docker by @cardyok in #64
- feat(pkg/systemd): remove redundant utils, move "pkg/update" by @gyuho in #67
- feat(nvidia/fabric-manager): alert on nvlink multicast failures by @gyuho in #71
- client(v1): move examples, add info by component by @gyuho in #65
- feat(pkg/process): rename stop to abort, add systemd/journal utils by @gyuho in #68
- feat(dmesg): add oom-kill:constraint regex for cri-containerd events by @gyuho in #70
- feat(nvidia/query): fabric manager debugging info from journalctl by @gyuho in #69
- feat(internal/session): add missing writer close for session writer by @gyuho in #66
- fix(pkg/process): panic on wait before process initialization by @gyuho in #72
Full Changelog: v0.0.1...v0.0.2
gpud-v0.0.1
GPUd release notes (2024-09-10T13:27:05Z)
Welcome to this new release!
What's Changed
- doc(README): add badges, official links by @gyuho in #57
- feat(systemd): enable gpud service by @cardyok in #59
- fix(nvidia/query): handle error when lsmod reader is already closed for peermem checker by @gyuho in #60
- fix(components/docker): do not set not healthy if docker client version incompatible by @gyuho in #62
- fix(update): check update version in "gpud update" command by @hm2501 in #61
- feat(client/v1): add basic get/read v1 API calls by @gyuho in #58
- feat(goreleaser): use ubuntu 20.04 build as default linux artifact by @gyuho in #63
Full Changelog: v0.0.1-alpha9...v0.0.1
gpud-v0.0.1-alpha9
GPUd release notes (2024-09-09T06:45:50Z)
Welcome to this new release!
What's Changed
- doc(nvidia/error/xid): document how xid error is detected using dmesg by @gyuho in #46
- fix(docs): pkg.go.dev links, add Makefile CGO_ENABLED=1 by @flyer103 in #47
- feat(nvidia/nvml): include device uuid for xid event by @gyuho in #50
- fix(nvidia/nvml): remove xid event polling gaps, log when event happens by @gyuho in #49
- fix(nvidia/nvml): mark xid 68 as user app error, document by @gyuho in #51
- fix(nvidia): skip clock events NVML check if not supported by old drivers by @gyuho in #48
- fix(components/fd): use system-wide file descriptor limit, add default 1-million threshold limit, remove "_avg" metrics in fd component by @gyuho in #52
- fix(accelerator/nvidia): panic when ibstat command fails, when recording errors by @gyuho in #53
- fix(event): add timestamp for xid/sxid error event by @cardyok in #55
- fix(session): handle io closed on write failure by @cardyok in #54
- fix(accelerator/nvidia/gpm): add missing Healthy: true field by @gyuho in #56
New Contributors
Full Changelog: v0.0.1-alpha8...v0.0.1-alpha9
gpud-v0.0.1-alpha8
GPUd release notes (2024-09-02T02:41:14Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.0.1-alpha7...v0.0.1-alpha8