You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
in case of service restart on single machine (e.g. master or last machine standing), the replacement instance first starts until readyness, then the replaced instance shuts down (~= continuous deployment),
redundant (# service instances > 1) distributed on different machines (1 instance fail does not fail the system),
in case of unexpected downtime OPs shall detect it before users do
Scalable service
able to run in multiple service instances without breaking functionality,
ideally shares the load between the service instances
Restartable service
the new service can start without breaking itself or the service it replaces (it is running in parallel with the to-be-replaced service for a short amount of time)
Resumable service
the replacing service restarts all the tasks the replaced service was doing (this imply saving its current state before crashing or switching off),
or the service client identify the service restarted and restart the lost tasks (implies identifying the service was removed, and retry)
Communication among services in oSparc
REST API requests
a service calls a REST API entrypoint of another service which returns a direct response
a REST call has a timeout of X seconds, anything longer fails
no option to get a request progress
a request can fail due to network failure
if the server is restarted or crashed, the request is failed
oSparc REST Long running task
A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure:
POST /tasks --> starts a long task, returns the task ID
GET /tasks/{id}/status --> gets the status of a task, returns whether the task is running and its progress.
GET /tasks/{id}/result --> gets the result of a task
all requests are short (few ms)
returns request progress
a request can fail due to network failure
if the server is restarted or crashed, the request is failed
if the server is has multiple instances, then the client must talk always with the same instance in order to get the status
RPC call through Message broker (for example RabbitMQ)
a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond
the caller is agnostic to the callee, it only needs to know the function signature
the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
issue if the broker is overloaded
no option to get a request progress
RPC long running task (not yet implemented)
A long running task would be something along these lines:
RPC: create task --> start a long running task, returns the task ID
RPC: get task status(ID) --> returns the task status, its progress
RPC: get task result(ID) --> returns the task result
(Note that this is not the real implementation, it could be a python generator, a celery task or anything else)
the caller is agnostic to the callee, it only needs to know the function signature
the broker can be configured to retry distributing the task if the service is not available or restarted or crashed
issue if the broker is overloaded
Current oSparc issues
non-scalable services
storage
long running tasks:
POST /v0/simcore-s3/folders - copy folders of a project
background task:
multipart upload garbage collection (cleans DB and/or S3) - does not prevent restarting/scaling as it relies on DB but generates unwanted traffic
director-v2
background tasks:
dynamic scheduler - prevent restarting:
bugs (some of which unknown) and no tests that guarantee it can restart
the list of steps to start/stop a service is: 1) too big and they do too much 2) they were not designed to be restarted
the internal state of a service is "saved" when all steps are done processing (which makes it useless in most cases) -> director-v2 can be restarted ONLY ONCE no more services are starting or stopping!
current code that starts and stop services is written so poorly that it makes it very time consuming and very likely to fail when changing anything -> requires a rewrite to make into something usable.
computational scheduler - unsure, might work but will generate unwanted additional traffic
The content you are editing has changed. Please copy your edits and refresh the page.
Originally discussed in #5560
Glossary
Definitions
High availability service
Scalable service
Restartable service
Resumable service
Communication among services in oSparc
REST API requests
a service calls a REST API entrypoint of another service which returns a direct response
oSparc REST Long running task
A long running task is a task that typically takes a long time to complete (for example copying S3 files - project copy) and follows the procedure:
RPC call through Message broker (for example RabbitMQ)
a service calls a function signature in the message broker, which delivers the message to another service which will eventually respond
RPC long running task (not yet implemented)
A long running task would be something along these lines:
(Note that this is not the real implementation, it could be a python generator, a celery task or anything else)
Current oSparc issues
non-scalable services
storage
long running tasks:
POST /v0/simcore-s3/folders
- copy folders of a projectbackground task:
director-v2
background tasks:
Tasks
The text was updated successfully, but these errors were encountered: