You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, zeusd assumes NVML operations will mostly succeed. However, for this to be more robust, we want to handle more failure cases. NVML might hang for some unknown reason, and we don't want the management task in zeusd (and thus a blocking request) to also hang forever. Or GPU might go lost, which will raise a specific error from NVML.
We want some timeout, a cancellation mechanism, and a way to mark the GPU as dead so that subsequent requests don't wait the full timeout. The failure will be reported, but we don't want zeusd threads to panic and burn and die.
The text was updated successfully, but these errors were encountered:
Timeouts for sync code cannot be done with tokio::time::timeout since it only checks deadline misses on yield points. Instead, perhaps it's a better idea to have a dedicated thread for each GPU manager (instead of the current Tokio task implementation) and use tokio::time::timeout on channel.recv, assuming channel.recv has multiple yield points internally while awaiting. Tokio channels can do sync-async communication (e.g., Receiver.blocking_recv.
Right now,
zeusd
assumes NVML operations will mostly succeed. However, for this to be more robust, we want to handle more failure cases. NVML might hang for some unknown reason, and we don't want the management task inzeusd
(and thus a blocking request) to also hang forever. Or GPU might go lost, which will raise a specific error from NVML.We want some timeout, a cancellation mechanism, and a way to mark the GPU as dead so that subsequent requests don't wait the full timeout. The failure will be reported, but we don't want
zeusd
threads to panic and burn and die.The text was updated successfully, but these errors were encountered: