You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It turns out that the user program has an 0 return code even if a failure happens, which causes our driver process to wait for any in-progress processes to return with 0 return code.
A user reported that although their distributed training code failed due to NCCL error on worker node, SkyPilot does not set the job in FAILED state.
Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_INThe text was updated successfully, but these errors were encountered: