Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial task manager resiliency and error handling #789

Merged
merged 3 commits into from
Mar 26, 2024

Conversation

jtronge
Copy link
Collaborator

@jtronge jtronge commented Feb 28, 2024

These are some initial changes to improve resiliency and error-handling in the task manager. This doesn't completely resolve issues #675 and #676, but I thought I'd open this now since these changes are relatively self-contained. Rusty and I discussed those issues and we were thinking it might be best to have a longer discussion about them and the interaction between the task manager and the workflow manager.

Also, I think this should resolve #550. The builder should now be throwing exceptions when it fails to build or pull a container. I added an integration test case for an invalid container build.

The Task Manager will now resubmit jobs that failed due to error
statuses indicating failures not caused by the job itself, such as
NODE_FAIL. These are all Slurm state codes right now and may not be
applicable to other schedulers like Flux.
Adds a ContainerBuildError exception to be thrown on builder failures.
The Task Manager catches this and sets task states to BUILD_FAIL.
Copy link
Collaborator

@aquan9 aquan9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good to me. Good job writing a new test for your changes.

@pagrubel pagrubel self-assigned this Mar 26, 2024
@pagrubel pagrubel merged commit c907ff0 into develop Mar 26, 2024
4 checks passed
@pagrubel pagrubel deleted the task-manager-resiliency branch March 26, 2024 23:11
@pagrubel pagrubel mentioned this pull request Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Builder Errors
3 participants