Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Figure out a better alternative for the Celery subprocess in subprocess problem #54

Closed
lfse-slafleur opened this issue Jun 5, 2024 · 1 comment · Fixed by #58
Closed

Comments

@lfse-slafleur
Copy link
Member

lfse-slafleur commented Jun 5, 2024

Celery unfortunately starts the worker subprocesses as 'daemons' which prevents the worker process of 'optimizer-worker' from creating any other subprocesses. Therefore, we acknowledge this process is a daemon and turn of the protectioon that prevents new subprocesses from being created. This does introduce the issue that if the task is cancelled/revoked, the subprocess created by the worker subprocess will continue as a zombie process until it completes.

This is necessary as Casadi does not release the GIL which starves all other threads. We need to isolate mesido/Casadi to its own process.

Many alternatives were tried:

  • Use threading as a task pool instead of subprocess in Celery worker. Threading does not support 'terminate_job' and therefore we cannot cancel a task. Other task pool types also do not support 'terminate_job' nor provide the required isolation.
  • Use subprocess.run(). While this does not throw an error if ran from a daemon process, it still causes the same issue. Also, exceptions are no longer propagated so we need to ensure an exit_code != throws an error. This is quite convoluted.
  • Use python -O. This turns off asserts (including the ones in Mesido!!) and therefore ignores the error. However, the issue remains so not a real solution.
  • Investigated alternatives to Celery. Dramatiq does not support cancellation of tasks. Huey does not support AMQP.

Remaining alternatives:

  • Wait for python 3.13 which removes the GIL.
  • Ask Casadi to fix the issue where they never release the GIL. --> No more subprocesses in subprocesses needed. See also Do not keep python GIL casadi/casadi#2955
  • Extend Celery ThreadingTaskPool to support terminate_job by asking the underlying thread nicely to stop as soon as it can. Also, we would switch over to 'threading' as the task pool. Will require some work as Celery is quite complex.
  • Find an alternative for Celery or build our own. --> Will not resolve issue of allowing subprocesses in subprocesses, but may give the freedom to design the worker in such a way that only a single subprocess (to isolate Casadi) is needed.
@lfse-slafleur
Copy link
Member Author

We landed on hooking into the way Celery cancels the forked worker process. It throws a SystemExit with code -241. By listening for the SystemExit, we can terminate the multiprocessing.Pool with Mesido worker before the Celery worker waits for the forked worker process to terminate. This is still a workaround as Casadi still does not release the GIL. This is now captured in: #57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant