Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture: Make services restartable #5614

Open
4 tasks
SCA-ZMT opened this issue Apr 4, 2024 Discussed in #5560 · 1 comment
Open
4 tasks

Architecture: Make services restartable #5614

SCA-ZMT opened this issue Apr 4, 2024 Discussed in #5560 · 1 comment
Assignees
Milestone

Comments

@SCA-ZMT
Copy link

SCA-ZMT commented Apr 4, 2024

Originally discussed in #5560

Glossary

Definitions

High availability service

  • never crashes (sic),
  • in case of service restart on single machine (e.g. master or last machine standing), the replacement instance first starts until readyness, then the replaced instance shuts down (~= continuous deployment),
  • redundant (# service instances > 1) distributed on different machines (1 instance fail does not fail the system),
  • in case of unexpected downtime OPs shall detect it before users do

Scalable service

  • able to run in multiple service instances without breaking functionality,
  • ideally shares the load between the service instances

Restartable service

  • the new service can start without breaking itself or the service it replaces (it is running in parallel with the to-be-replaced service for a short amount of time)

Resumable service

  • the replacing service restarts all the tasks the replaced service was doing (this imply saving its current state before crashing or switching off),
  • or the service client identify the service restarted and restart the lost tasks (implies identifying the service was removed, and retry)
Communication among services in oSparc

REST API requests

a service calls a REST API entrypoint of another service which returns a direct response

  • a REST call has a timeout of X seconds, anything longer fails
  • no option to get a request progress
  • a request can fail due to network failure
  • if the server is restarted or crashed, the request is failed

oSparc REST Long running task

A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure:

  1. POST /tasks --> starts a long task, returns the task ID
  2. GET /tasks/{id}/status --> gets the status of a task, returns whether the task is running and its progress.
  3. GET /tasks/{id}/result --> gets the result of a task
  • all requests are short (few ms)
  • returns request progress
  • a request can fail due to network failure
  • if the server is restarted or crashed, the request is failed
  • if the server is has multiple instances, then the client must talk always with the same instance in order to get the status

RPC call through Message broker (for example RabbitMQ)

a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond

  • the caller is agnostic to the callee, it only needs to know the function signature
  • the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
  • issue if the broker is overloaded
  • no option to get a request progress

RPC long running task (not yet implemented)

A long running task would be something along these lines:

  1. RPC: create task --> start a long running task, returns the task ID
  2. RPC: get task status(ID) --> returns the task status, its progress
  3. RPC: get task result(ID) --> returns the task result
    (Note that this is not the real implementation, it could be a python generator, a celery task or anything else)
  • the caller is agnostic to the callee, it only needs to know the function signature
  • the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
  • issue if the broker is overloaded
Current oSparc issues

non-scalable services

storage

long running tasks:

  • POST /v0/simcore-s3/folders - copy folders of a project

background task:

  • multipart upload garbage collection (cleans DB and/or S3) - does not prevent restarting/scaling as it relies on DB but generates unwanted traffic

director-v2

background tasks:

  • dynamic scheduler - prevent restarting:
    • bugs (some of which unknown) and no tests that guarantee it can restart
    • the list of steps to start/stop a service is: 1) too big and they do too much 2) they were not designed to be restarted
    • the internal state of a service is "saved" when all steps are done processing (which makes it useless in most cases) -> director-v2 can be restarted ONLY ONCE no more services are starting or stopping!
    • current code that starts and stop services is written so poorly that it makes it very time consuming and very likely to fail when changing anything -> requires a rewrite to make into something usable.
  • computational scheduler - unsure, might work but will generate unwanted additional traffic

Tasks

Preview Give feedback
  1. 1 of 7
    a:storage
    GitHK matusdrobuliak66
    sanderegg
  2. bisgaard-itis
  3. 2 of 3
    a:director-v2 t:maintenance
    GitHK sanderegg
@mrnicegyu11
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants