Releases: leptonai/gpud
Releases · leptonai/gpud
gpud-v0.2.5
GPUd release notes (2024-11-27T11:59:45Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.2.4...v0.2.5
gpud-v0.2.4
GPUd release notes (2024-11-27T05:33:14Z)
Welcome to this new release!
What's Changed
- feat(nvidia/query): helpful debugging lines for nvml device list call failures by @gyuho in #201
- fix(nvidia/infiniband): match mellanox to count PCI devices by @gyuho in #204
- feat(nvidia/temperature): port DCGM_FR_TEMP_VIOLATION logic for high temperature alerts by @gyuho in #208
- fix(cmd/gpud): add --log-level flag to "scan", fix flag parsing for "run" commands, remove "scan --debug" flag by @gyuho in #206
- fix(pkg/systemd): handle "n/a" in uptime with trailing characters by @gyuho in #207
- feat(nvidia): re-order nvidia-smi collect after NVML calls by @gyuho in #202
- fix(session): disable http keep alive by @cardyok in #205
- nit(k8s/pod): quote string node name in case it's empty by @gyuho in #209
Full Changelog: v0.2.3...v0.2.4
gpud-v0.2.3
GPUd release notes (2024-11-21T18:04:41Z)
Welcome to this new release!
What's Changed
- fix(nvidia/infiniband): use sysclass ib directory count as default port state checks, use Infiniband PCI bus count to decide whether Infiniband is enabled or not by @gyuho in #200
Full Changelog: v0.2.2...v0.2.3
gpud-v0.2.2
GPUd release notes (2024-11-21T16:30:17Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.2.1...v0.2.2
gpud-v0.2.1
GPUd release notes (2024-11-21T15:27:43Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.2.0...v0.2.1
gpud-v0.2.0
GPUd release notes (2024-11-21T13:01:44Z)
Welcome to this new release!
What's Changed
- nit(nvidia/xid): add more Xid 119 test case, simpler detection logging by @gyuho in #190
- feat(nvidia): parse infiniband ibstat for error checking based on GPU card counts by @gyuho in #189
- feat(components/os): use 20% of system descriptor limit for zombie process alerts by @gyuho in #193
- feat(session): add idle session timeout by @cardyok in #194
- fix(log/tail): correctly collect xid/sxid events from log scanner by @gyuho in #192
- fix(config/default): the flag "kubelet-ignore-connection-errors" is n… by @popsiclexu in #195
- feat(component/kernel-module): initial commit (track /etc/modules) by @gyuho in #191
- feat(nvidia/infiniband): make port states configurable by @gyuho in #196
- fix(join): remove space in provider by @cardyok in #197
New Contributors
- @popsiclexu made their first contribution in #195
Full Changelog: v0.1.9...v0.2.0
gpud-v0.1.9
GPUd release notes (2024-11-15T17:17:22Z)
Welcome to this new release!
What's Changed
- feat(internal/server): periodic status check logs in debug level by @gyuho in #186
- fix(internal/server): handle poller events no data error (don't error level log) by @gyuho in #185
- fix(nvidia/xid-sxid-state): persist xid/sxid in tail scan, better logging by @gyuho in #187
- fix(accelerator/nvidia): add missing poller initialization by @gyuho in #184
- feat(nvidia, dmesg): use dmesg iso for millisecond level, merge peermem events by minute level by @gyuho in #188
Full Changelog: v0.1.8...v0.1.9
gpud-v0.1.8
GPUd release notes (2024-11-14T16:34:51Z)
Welcome to this new release!
What's Changed
- fix(components/dmesg): do not read raw dmesg file with unix time by @gyuho in #182
- feat(query/log/tail): log stream with deduper by @gyuho in #183
- fix(nvidia/query): quote unusual process name for nvidia-smi parsing by @gyuho in #181
- feat(nvidia/error-xid-sxid): new component based on persistent xid, sxid event history by @gyuho in #157
Full Changelog: v0.1.7...v0.1.8
gpud-v0.1.7
GPUd release notes (2024-11-13T15:19:25Z)
Welcome to this new release!
What's Changed
- fix(nvidia/query): skip xid=0 event by @gyuho in #164
- fix(join): use cli flag as default value if skip interactive by @cardyok in #177
- ci(github): run uber-go/nilaway by @gyuho in #173
- feat(go.mod): bump up mattn/go-sqlite3 to @82bc911 by @gyuho in #167
- fix(pkg/process): skip without log for no such file by @gyuho in #172
- nits(components/nvidia): rename dmesg tail scan operations to be clearer by @gyuho in #176
- fix(components/systemd): fix non-existing service uptime checks by @gyuho in #178
- feat(components/query): move parse time func, filter to common packages by @gyuho in #175
- feat(pkg/sqlite): add Open function, remove unused table name by @gyuho in #169
- nit(components/nvidia/query): pass options to start default instance by @gyuho in #171
- feat(components/nvidia): separate id packages for xid, sxid by @gyuho in #170
- feat(pkg/dmesg): use common ctime parser by @gyuho in #174
- feat(tail/log): dedup same log string in scanner by @gyuho in #165
- feat(join): support specifying private ip by @cardyok in #179
- nit(components): clarify create table functions by @gyuho in #168
- feat(component/events): return query.ErrNoData if no event is found by @gyuho in #166
- fix(nvidia/query): correctly initialize default poller instance by @gyuho in #180
Full Changelog: v0.1.5...v0.1.7
gpud-v0.1.5
GPUd release notes (2024-11-08T02:20:32Z)
Welcome to this new release!
What's Changed
- fix(join): user detected provider by default if not provided by @cardyok in #146
- fix(components): do not attempt register if already registered by @gyuho in #148
- fix(nvidia): remove init func for nvidia packages (do not check nvidia if not needed) by @gyuho in #150
- nit(nvidia/nvml): log nvml call failures in critical paths, add xid 119 test cases by @gyuho in #149
- nits(accelerator/nvidia): rename reason fields, Xid/SXid detail fields for clarification by @gyuho in #151
- nits(accelerator): remove redundant criticality, suggested action fields, define sxid reason struct only by @gyuho in #152
- feat(components/os): count process counts per status (e.g., zombie, detached) by @gyuho in #147
- feat(nvidia/xid,sxid): more accurate criticality, suggested actions by GPUd, catch all Xids by @gyuho in #145
- feat(components): rename action name REPAIR_HARDWARE to HARDWARE_INSPECTION by @gyuho in #153
- feat(nvidia/xid, sxid): catch all events, return critical for /states, non-critical ones for /events by @gyuho in #155
- chore(fix): fix typo corrent -> correct by @Yangqing in #158
- feat(package): support deleting packages on session delete by @cardyok in #154
- nits(pkg): move "go-pkg" ones to "pkg" by @gyuho in #156
- feat(nvidia/xid, sxid): return all via /events by @gyuho in #159
- feat(nvidia): add/document xid 94, sxid 20009 by @gyuho in #160
- feat(nvidia/remapped-rows): suggest reboot/inspection on row remapping by @gyuho in #161
- feat(notify): support sending notification to control plane by @cardyok in #163
- feat(nvidia/gsp-firmware-mode): initial commit to track GSP modes by @gyuho in #162
New Contributors
Full Changelog: v0.1.2...v0.1.5