Skip to content

Releases: leptonai/gpud

gpud-v0.2.5

27 Nov 12:00
44f485d
Compare
Choose a tag to compare

GPUd release notes (2024-11-27T11:59:45Z)

Welcome to this new release!

What's Changed

  • feat(session): optimize default transport config by @cardyok in #210

Full Changelog: v0.2.4...v0.2.5

gpud-v0.2.4

27 Nov 05:34
cc844d9
Compare
Choose a tag to compare

GPUd release notes (2024-11-27T05:33:14Z)

Welcome to this new release!

What's Changed

  • feat(nvidia/query): helpful debugging lines for nvml device list call failures by @gyuho in #201
  • fix(nvidia/infiniband): match mellanox to count PCI devices by @gyuho in #204
  • feat(nvidia/temperature): port DCGM_FR_TEMP_VIOLATION logic for high temperature alerts by @gyuho in #208
  • fix(cmd/gpud): add --log-level flag to "scan", fix flag parsing for "run" commands, remove "scan --debug" flag by @gyuho in #206
  • fix(pkg/systemd): handle "n/a" in uptime with trailing characters by @gyuho in #207
  • feat(nvidia): re-order nvidia-smi collect after NVML calls by @gyuho in #202
  • fix(session): disable http keep alive by @cardyok in #205
  • nit(k8s/pod): quote string node name in case it's empty by @gyuho in #209

Full Changelog: v0.2.3...v0.2.4

gpud-v0.2.3

21 Nov 18:05
112f28b
Compare
Choose a tag to compare

GPUd release notes (2024-11-21T18:04:41Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/infiniband): use sysclass ib directory count as default port state checks, use Infiniband PCI bus count to decide whether Infiniband is enabled or not by @gyuho in #200

Full Changelog: v0.2.2...v0.2.3

gpud-v0.2.2

21 Nov 16:31
8bea6e9
Compare
Choose a tag to compare

GPUd release notes (2024-11-21T16:30:17Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/infiniband): use "<" to evaluate ip port rates by @gyuho in #199

Full Changelog: v0.2.1...v0.2.2

gpud-v0.2.1

21 Nov 15:28
e43da55
Compare
Choose a tag to compare

GPUd release notes (2024-11-21T15:27:43Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/infiniband): adjust default port rate based on GPU product by @gyuho in #198

Full Changelog: v0.2.0...v0.2.1

gpud-v0.2.0

21 Nov 13:03
b0f3451
Compare
Choose a tag to compare

GPUd release notes (2024-11-21T13:01:44Z)

Welcome to this new release!

What's Changed

  • nit(nvidia/xid): add more Xid 119 test case, simpler detection logging by @gyuho in #190
  • feat(nvidia): parse infiniband ibstat for error checking based on GPU card counts by @gyuho in #189
  • feat(components/os): use 20% of system descriptor limit for zombie process alerts by @gyuho in #193
  • feat(session): add idle session timeout by @cardyok in #194
  • fix(log/tail): correctly collect xid/sxid events from log scanner by @gyuho in #192
  • fix(config/default): the flag "kubelet-ignore-connection-errors" is n… by @popsiclexu in #195
  • feat(component/kernel-module): initial commit (track /etc/modules) by @gyuho in #191
  • feat(nvidia/infiniband): make port states configurable by @gyuho in #196
  • fix(join): remove space in provider by @cardyok in #197

New Contributors

Full Changelog: v0.1.9...v0.2.0

gpud-v0.1.9

15 Nov 17:18
f2e7792
Compare
Choose a tag to compare

GPUd release notes (2024-11-15T17:17:22Z)

Welcome to this new release!

What's Changed

  • feat(internal/server): periodic status check logs in debug level by @gyuho in #186
  • fix(internal/server): handle poller events no data error (don't error level log) by @gyuho in #185
  • fix(nvidia/xid-sxid-state): persist xid/sxid in tail scan, better logging by @gyuho in #187
  • fix(accelerator/nvidia): add missing poller initialization by @gyuho in #184
  • feat(nvidia, dmesg): use dmesg iso for millisecond level, merge peermem events by minute level by @gyuho in #188

Full Changelog: v0.1.8...v0.1.9

gpud-v0.1.8

14 Nov 16:35
9cc8610
Compare
Choose a tag to compare

GPUd release notes (2024-11-14T16:34:51Z)

Welcome to this new release!

What's Changed

  • fix(components/dmesg): do not read raw dmesg file with unix time by @gyuho in #182
  • feat(query/log/tail): log stream with deduper by @gyuho in #183
  • fix(nvidia/query): quote unusual process name for nvidia-smi parsing by @gyuho in #181
  • feat(nvidia/error-xid-sxid): new component based on persistent xid, sxid event history by @gyuho in #157

Full Changelog: v0.1.7...v0.1.8

gpud-v0.1.7

13 Nov 15:15
2e00123
Compare
Choose a tag to compare

GPUd release notes (2024-11-13T15:19:25Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/query): skip xid=0 event by @gyuho in #164
  • fix(join): use cli flag as default value if skip interactive by @cardyok in #177
  • ci(github): run uber-go/nilaway by @gyuho in #173
  • feat(go.mod): bump up mattn/go-sqlite3 to @82bc911 by @gyuho in #167
  • fix(pkg/process): skip without log for no such file by @gyuho in #172
  • nits(components/nvidia): rename dmesg tail scan operations to be clearer by @gyuho in #176
  • fix(components/systemd): fix non-existing service uptime checks by @gyuho in #178
  • feat(components/query): move parse time func, filter to common packages by @gyuho in #175
  • feat(pkg/sqlite): add Open function, remove unused table name by @gyuho in #169
  • nit(components/nvidia/query): pass options to start default instance by @gyuho in #171
  • feat(components/nvidia): separate id packages for xid, sxid by @gyuho in #170
  • feat(pkg/dmesg): use common ctime parser by @gyuho in #174
  • feat(tail/log): dedup same log string in scanner by @gyuho in #165
  • feat(join): support specifying private ip by @cardyok in #179
  • nit(components): clarify create table functions by @gyuho in #168
  • feat(component/events): return query.ErrNoData if no event is found by @gyuho in #166
  • fix(nvidia/query): correctly initialize default poller instance by @gyuho in #180

Full Changelog: v0.1.5...v0.1.7

gpud-v0.1.5

08 Nov 02:21
c9fba83
Compare
Choose a tag to compare

GPUd release notes (2024-11-08T02:20:32Z)

Welcome to this new release!

What's Changed

  • fix(join): user detected provider by default if not provided by @cardyok in #146
  • fix(components): do not attempt register if already registered by @gyuho in #148
  • fix(nvidia): remove init func for nvidia packages (do not check nvidia if not needed) by @gyuho in #150
  • nit(nvidia/nvml): log nvml call failures in critical paths, add xid 119 test cases by @gyuho in #149
  • nits(accelerator/nvidia): rename reason fields, Xid/SXid detail fields for clarification by @gyuho in #151
  • nits(accelerator): remove redundant criticality, suggested action fields, define sxid reason struct only by @gyuho in #152
  • feat(components/os): count process counts per status (e.g., zombie, detached) by @gyuho in #147
  • feat(nvidia/xid,sxid): more accurate criticality, suggested actions by GPUd, catch all Xids by @gyuho in #145
  • feat(components): rename action name REPAIR_HARDWARE to HARDWARE_INSPECTION by @gyuho in #153
  • feat(nvidia/xid, sxid): catch all events, return critical for /states, non-critical ones for /events by @gyuho in #155
  • chore(fix): fix typo corrent -> correct by @Yangqing in #158
  • feat(package): support deleting packages on session delete by @cardyok in #154
  • nits(pkg): move "go-pkg" ones to "pkg" by @gyuho in #156
  • feat(nvidia/xid, sxid): return all via /events by @gyuho in #159
  • feat(nvidia): add/document xid 94, sxid 20009 by @gyuho in #160
  • feat(nvidia/remapped-rows): suggest reboot/inspection on row remapping by @gyuho in #161
  • feat(notify): support sending notification to control plane by @cardyok in #163
  • feat(nvidia/gsp-firmware-mode): initial commit to track GSP modes by @gyuho in #162

New Contributors

Full Changelog: v0.1.2...v0.1.5