Architecture: Make services restartable #5614

SCA-ZMT · 2024-04-04T14:03:08Z

Originally discussed in #5560

Glossary

Definitions

High availability service

never crashes (sic),
in case of service restart on single machine (e.g. master or last machine standing), the replacement instance first starts until readyness, then the replaced instance shuts down (~= continuous deployment),
redundant (# service instances > 1) distributed on different machines (1 instance fail does not fail the system),
in case of unexpected downtime OPs shall detect it before users do

Scalable service

able to run in multiple service instances without breaking functionality,
ideally shares the load between the service instances

Restartable service

the new service can start without breaking itself or the service it replaces (it is running in parallel with the to-be-replaced service for a short amount of time)

Resumable service

the replacing service restarts all the tasks the replaced service was doing (this imply saving its current state before crashing or switching off),
or the service client identify the service restarted and restart the lost tasks (implies identifying the service was removed, and retry)

Communication among services in oSparc

REST API requests

a service calls a REST API entrypoint of another service which returns a direct response

a REST call has a timeout of X seconds, anything longer fails
no option to get a request progress
a request can fail due to network failure
if the server is restarted or crashed, the request is failed

oSparc REST Long running task

A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure:

POST /tasks --> starts a long task, returns the task ID
GET /tasks/{id}/status --> gets the status of a task, returns whether the task is running and its progress.
GET /tasks/{id}/result --> gets the result of a task

all requests are short (few ms)
returns request progress
a request can fail due to network failure
if the server is restarted or crashed, the request is failed
if the server is has multiple instances, then the client must talk always with the same instance in order to get the status

RPC call through Message broker (for example RabbitMQ)

a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond

the caller is agnostic to the callee, it only needs to know the function signature
the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
issue if the broker is overloaded
no option to get a request progress

RPC long running task (not yet implemented)

A long running task would be something along these lines:

RPC: create task --> start a long running task, returns the task ID
RPC: get task status(ID) --> returns the task status, its progress
RPC: get task result(ID) --> returns the task result
(Note that this is not the real implementation, it could be a python generator, a celery task or anything else)

the caller is agnostic to the callee, it only needs to know the function signature
the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
issue if the broker is overloaded

Current oSparc issues

non-scalable services

storage

long running tasks:

POST /v0/simcore-s3/folders - copy folders of a project

background task:

multipart upload garbage collection (cleans DB and/or S3) - does not prevent restarting/scaling as it relies on DB but generates unwanted traffic

director-v2

background tasks:

dynamic scheduler - prevent restarting:
- bugs (some of which unknown) and no tests that guarantee it can restart
- the list of steps to start/stop a service is: 1) too big and they do too much 2) they were not designed to be restarted
- the internal state of a service is "saved" when all steps are done processing (which makes it useless in most cases) -> director-v2 can be restarted ONLY ONCE no more services are starting or stopping!
- current code that starts and stop services is written so poorly that it makes it very time consuming and very likely to fail when changing anything -> requires a rewrite to make into something usable.
computational scheduler - unsure, might work but will generate unwanted additional traffic

Tasks

Give feedback

Make storage scalable/restartable #5621

1 of 7

a:storage
Create RPC interface for long running tasks #5634
Make director-v2 scalable/restartable #4524

2 of 3

a:director-v2 t:maintenance
REST client shall identify when a service disappeared and restart tasks
Options

The text was updated successfully, but these errors were encountered:

mrnicegyu11 · 2024-09-23T08:17:47Z

ITISFoundation/osparc-issues#1638

SCA-ZMT mentioned this issue Apr 4, 2024

Maintenance / Dev Issues ITISFoundation/osparc-issues#1328

Open

SCA-ZMT assigned GitHK, sanderegg and matusdrobuliak66 Apr 4, 2024

SCA-ZMT mentioned this issue Apr 4, 2024

Performance Improvements for Large Projects ITISFoundation/osparc-issues#1327

Open

sanderegg assigned bisgaard-itis and YuryHrytsuk Apr 5, 2024

sanderegg added this to the Enchanted Odyssey milestone Apr 8, 2024

matusdrobuliak66 mentioned this issue Apr 10, 2024

♻️ refactoring dsm cleaner storage background task #5653

Merged

1 task

YuryHrytsuk removed their assignment Apr 25, 2024

sanderegg modified the milestones: Enchanted Odyssey, The Next One May 6, 2024

sanderegg modified the milestones: Leeroy Jenkins, South Island Iced Tea Jun 7, 2024

sanderegg modified the milestones: South Island Iced Tea, Tom Bombadil Jul 8, 2024

sanderegg modified the milestones: Tom Bombadil, Eisbock Aug 13, 2024

This was referenced Aug 20, 2024

Maintenance: All simcore service should be scalable #4452

Closed

Zero downtime: Connected users should be able to continue working with osparc when osparc micro-services are restarted #2212

Closed

sanderegg modified the milestones: Eisbock, Doppelbock Sep 13, 2024

sanderegg modified the milestones: MartinKippenberger, Event Horizon Nov 29, 2024

sanderegg modified the milestones: Event Horizon, Singularity Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture: Make services restartable #5614

Architecture: Make services restartable #5614

SCA-ZMT commented Apr 4, 2024 •

edited

Loading

High availability service

Scalable service

Restartable service

Resumable service

REST API requests

oSparc REST Long running task

RPC call through Message broker (for example RabbitMQ)

RPC long running task (not yet implemented)

non-scalable services

storage

long running tasks:

background task:

director-v2

background tasks:

Tasks

mrnicegyu11 commented Sep 23, 2024

Architecture: Make services restartable #5614

Architecture: Make services restartable #5614

Comments

SCA-ZMT commented Apr 4, 2024 • edited Loading

Originally discussed in #5560

Glossary

High availability service

Scalable service

Restartable service

Resumable service

REST API requests

oSparc REST Long running task

RPC call through Message broker (for example RabbitMQ)

RPC long running task (not yet implemented)

non-scalable services

storage

long running tasks:

background task:

director-v2

background tasks:

Tasks

mrnicegyu11 commented Sep 23, 2024

SCA-ZMT commented Apr 4, 2024 •

edited

Loading