Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status-report: better/more specific summary for stuck workflows #342

Open
diegodelemos opened this issue Mar 2, 2021 · 0 comments
Open

Comments

@diegodelemos
Copy link
Member

Background

As per #341 we stop using the Kubernetes API to determine whether the maximum number of concurrent workflows has been reached. Now we use the database value instead.

Problem

Since the database might get desynchronized with the live system (Kubernetes), we might run into problems when workflows get stuck in running status (as it has happened in the past see reanahub/reana#478). These workflows will count for the maximum concurrent running workflows, potentially blocking the system, queueing all workflows forever (as it has happened in reanahub/reana-commons#250).

Status-report oriented solution

Improve the status report summary to take into account situations in which workflows are stuck in a running state.

  • Better and more specific stuck workflows detection: current naive implementation doesn't catch the specific nature of stuck workflows:
  • Include quick actions/commands to fix issues: When the specific nature of a stuck workflow is detected, the summary could include quick actions so admins can run to clean up the cluster (note this could later be used for automatic garbage collection of workflows)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant