Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: transfers are abandoned if the workflow is interrupted #36

Closed
scollazo opened this issue Jan 24, 2020 · 6 comments
Closed

Problem: transfers are abandoned if the workflow is interrupted #36

scollazo opened this issue Jan 24, 2020 · 6 comments

Comments

@scollazo
Copy link
Contributor

There is no way to stop enduro without leaving transfers in an unknown state.

There should be a way to stop enduro gracefully, allowing all in progress tasks to finish without starting new ones.

@sevein
Copy link
Member

sevein commented Jan 24, 2020

Thanks @scollazo. It'd be nice to fix this. I think I know what's happening. Part of our workflow runs within a session (download, bundle, transfer, poll transfer, poll ingest and hari/prod activities). Cadence detects that the worker running the session dies and terminates the session automatically.

I think that we have to do a couple of things:

  • Retry the session when workflow.ErrSessionFailed is received. The complexity here is in that Cadence will likely schedule the session in a new worker (if Problem: workers can't be deployed separately #37 was fixed), so our workflow would need to learn how to handle that.
  • Reduce the session window as much as possible (it spans over multiple activities, including poll-transfer and poll-ingest).

I've also filed #37 to describe the lack of the ability to deploy workers separately. I'm mentioning this here because once you have workers, you may be stopping other processes while not affecting others.

@sevein sevein changed the title Problem: enduro can't be gracefully stopped Problem: transfers are abandoned if the workflow is interrupted Jan 24, 2020
@sevein sevein added this to the Backlog milestone Jan 24, 2020
@scollazo
Copy link
Contributor Author

scollazo commented Jan 28, 2020

While workers seems like a nice thing to have, I don't see how they'll solve the problem. Are you thinking about running a worker for each pipeline, and stop it in case of need, without stopping enduro itself?

@scollazo
Copy link
Contributor Author

scollazo commented Jan 28, 2020

What I had in mind when I filled the issue, was that enduro should be able to hadle SIGHUP signal, and stop itself once all "IN PROGRESS" transfers finish.

There can be corner cases where stuck tasks don't allow enduro to finish, and the process might need to be sent the SIGKILL signal, but that could be handled by systemd

@scollazo
Copy link
Contributor Author

For the record, this cadence issue might be of interest cadence-workflow/cadence-go-client#775

@sevein
Copy link
Member

sevein commented Jan 29, 2020

Discussed offline. Tentatife fix in 486be45 (v0.21.0). Enduro already had signal handling which results in cancelation of activities. In v0.21.0, the processing session is retried after cancelation. Activities within the session will be executed only when needed, e.g. transfer-activity won't be executed if we already have a TransferID and so on. E.g. if Archivematica was busy at Transfer when the worker died, then poll-trnasfer-activity will eventually run (unless we have a SIPID) and will go back to wait as expected. So the idea is that transfers won't be abandoned.

@sevein sevein closed this as completed Jan 29, 2020
@sevein
Copy link
Member

sevein commented Jan 29, 2020

Reminder: the semaphore that protects pipelines is still local and it's flushed when you kill Enduro so expect more transfers to enter the critical section. That's not something that we can control well until we solve #43.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants