Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zeusd] Better failure handling and testing #88

Open
jaywonchung opened this issue May 30, 2024 · 1 comment
Open

[Zeusd] Better failure handling and testing #88

jaywonchung opened this issue May 30, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@jaywonchung
Copy link
Member

Right now, zeusd assumes NVML operations will mostly succeed. However, for this to be more robust, we want to handle more failure cases. NVML might hang for some unknown reason, and we don't want the management task in zeusd (and thus a blocking request) to also hang forever. Or GPU might go lost, which will raise a specific error from NVML.

We want some timeout, a cancellation mechanism, and a way to mark the GPU as dead so that subsequent requests don't wait the full timeout. The failure will be reported, but we don't want zeusd threads to panic and burn and die.

@jaywonchung jaywonchung added the enhancement New feature or request label May 30, 2024
@jaywonchung
Copy link
Member Author

Timeouts for sync code cannot be done with tokio::time::timeout since it only checks deadline misses on yield points. Instead, perhaps it's a better idea to have a dedicated thread for each GPU manager (instead of the current Tokio task implementation) and use tokio::time::timeout on channel.recv, assuming channel.recv has multiple yield points internally while awaiting. Tokio channels can do sync-async communication (e.g., Receiver.blocking_recv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant