You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Worker doesn't exit with error when the primary is down abnormally as the StatusMessage is not checked. Would it be possible to exit workers with error when primary is down abnormally?
What impact does the worker not exiting with an error have? Presumably if the primary goes down prematurely it will exit with error and cause the job to fail. Is that not the case?
I think it results in false success messages in the worker algos which can cause confusion. I think for the primary yes it will exit with an error as you say. I think we should fail also the worker containers though.
Understood, and agreed this causes confusion. We'll have to update our shutdown logic to ensure the driver node communicates to the workers to exit successfully.
Thanks for the feedback, we're working this into our roadmap internally.
Worker doesn't exit with error when the primary is down abnormally as the StatusMessage is not checked. Would it be possible to exit workers with error when primary is down abnormally?
See flow here:
https://github.com/aws/sagemaker-spark-container/blob/master/src/smspark/job.py#L185
The text was updated successfully, but these errors were encountered: