Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S-D25.4 Simulation framework SCHEDULER Y4M05 #349

Open
4 of 8 tasks
KZzizzle opened this issue Oct 6, 2020 · 12 comments
Open
4 of 8 tasks

S-D25.4 Simulation framework SCHEDULER Y4M05 #349

KZzizzle opened this issue Oct 6, 2020 · 12 comments
Assignees
Labels
Epic Zenhub label (Pleas do not modify) PO issue Created by Product owners

Comments

@KZzizzle
Copy link
Contributor

KZzizzle commented Oct 6, 2020

  • Intelligent scheduler and queueing;
  • including for HPC

keep in mind

Definition of Done:

  • It is possible for a service to provide information about resources that it requires (estimated RAM, need for a certain kind of GPU, number of cores, perhaps even estimates of runtime - think s4l EM simulations). It gets dispatched accordingly.
  • Dependencies are automatically converted into optimized execution orders and parallel execution where feasible
  • Services that declare (s. above) that they support MPI can be executed across multiple docker instances with a shared MPI framework. For that, they are dispatched to suitably ‘close’ machines
  • Progress of queued/completed jobs can be inspected (queue manager) and tasks can be removed from the execution queue, or their relative priority adapted (less urgent).

User story:

Beyond done:

  • The necessary HW requirement information is part of a service’s meta-data, but can also be inspected by users (e.g., in the info section of the service)
  • Users with the necessary permissions can edit some of these settings (e.g., number of parallel instances in an MPI-parallelized service)

MVP for closing (25.4 and 25.5)

  • finish computational and dynamic sidecars
  • dask gateways (incl. for AWS)
  • ability of adding additional HW (incl. AWS; incl. frontend)
@KZzizzle KZzizzle added the PO issue Created by Product owners label Oct 6, 2020
@KZzizzle KZzizzle changed the title S-D25 .4 Simulation Framework Y4M05 S-D25 .4 Simulation Framework scheduler Y4M05 Oct 6, 2020
@pcrespov pcrespov changed the title S-D25 .4 Simulation Framework scheduler Y4M05 S-D25 .4 Simulation framework **SCHEDULER** Y4M05 Dec 1, 2020
@pcrespov pcrespov changed the title S-D25 .4 Simulation framework **SCHEDULER** Y4M05 S-D25 .4 Simulation framework SCHEDULER Y4M05 Dec 1, 2020
@pcrespov pcrespov added the Epic Zenhub label (Pleas do not modify) label Dec 1, 2020
@sanderegg
Copy link
Member

sanderegg commented Feb 28, 2021

status as of 03.01.2021

Done:

  • Computational states are now retrieved per node (modified, waiting for dependencies) backend + frontend

Ongoing:

  • Bugfixing on concurrent updates of nodes after running in a pipeline (when several sidecars are used on one pipeline)
  • Moving scheduling, currently done by sidecar itself to director-v2 --> simplifying architecture, reducing possible issues to allow for open enhancements/issues
  • Creation of dynamic service sidecar, to allow for improved security, better handling of dynamic services

Open:

  • ensure that sidecars do not access DB directly to remove the DB max connection problem
  • investigate using Airflow/Prefect.io for finer grained dispatching of computational pipelines (would also bring UI to better follow individual tasks)
  • create a Computational Resource Manager micro-service (CRMs) that would monitor the computational queues (load) to start/stop machines and sidecars to cope with variable computational load.

@sanderegg
Copy link
Member

sanderegg commented Mar 25, 2021

Update on sprint Red Panda

Done:

  • Bugfixing on concurrent updates of nodes after running in a pipeline (when several sidecars are used on one pipeline) #2174 --> will be deployed to production next sprint
  • Planning of scheduler improvements/requirements
  • Moving scheduling, currently done by sidecar itself to director-v2 --> simplifying architecture, reducing possible issues to allow for open enhancements/issues #2164 --> will be deployed to master next sprint (was waiting for maintenance cases to go in first)

Ongoing:

  • Evaluation of Apache Airflow / Prefect.io as workflow management tools #1992
  • Creation of dynamic service sidecar, to allow for improved security, better handling of dynamic services #1887

Open:

  • ensure that sidecars do not access DB directly to remove the DB max connection problem #2239
  • create a Computational Resource Manager micro-service (CRMs) that would monitor the computational queues (load) to start/stop machines and sidecars to cope with variable computational load. #2240

@pcrespov
Copy link
Member

pcrespov commented Jul 4, 2021

Update on sprint Marmoset

Done:

Ongoing:

Open:

  • ensure that sidecars do not access DB directly to remove the DB max connection problem #2239
  • create a Computational Resource Manager micro-service (CRMs) that would monitor the computational queues (load) to start/stop machines and sidecars to cope with variable computational load. #2240

@pcrespov pcrespov changed the title S-D25 .4 Simulation framework SCHEDULER Y4M05 S-D25.4 Simulation framework SCHEDULER Y4M05 Aug 12, 2021
@sanderegg
Copy link
Member

sanderegg commented Sep 10, 2021

Update on sprint Chevrotain

Done

  • Connect director-v2 to dask-based backend #2418
  • Allow dask-sidecar to run multiple services in separate threads #2486, #2487
  • Added facilities to create, modify, remove clusters and share them among groups of users #2517, #2502, #2499
  • Dask-sidecar to advertise clusterID if available on start #2505
  • Improving resiliency of Dask-based backend #2514
  • Bugfixes on dask-based backend #2512, #2508, #2503
  • Tested connection with AWS machines through docker swarm
  • Remove celery sidecars #2528

Ongoing

  • Refactor computational sidecar to remove dependencies to RabbitMQ/PostgresDB #2530
  • Testing with Dask gateway on AWS cluster
  • Adding nodeports to dynamic-sidecar #2509, #2516; extending existing validation image for nodeports support #138

@sanderegg
Copy link
Member

sanderegg commented Oct 6, 2021

Update on sprint Capra delle nevi

Done

  • Adding nodeports to dynamic-sidecar #2509, #2516; extending existing validation image for nodeports support #138
  • Running sidecar through Dask gateway POC

Ongoing

  • Refactor computational sidecar to remove dependencies to RabbitMQ/PostgresDB #2530
  • Connect director-v2 with dask-gateway(s) #2576

Open

  • Setup Dask gateway on AWS cluster (easy setup, autoscale)
  • Reliability, UI, register clusters

@sanderegg
Copy link
Member

sanderegg commented Nov 4, 2021

Update on sprint Anti-PER

Done

  • Replacing dynamic sidecar proxy from Traefik to Caddy for faster start #2597
  • Allow node_ports package to create and get presigned links in storage for delayed upload and downloads by the dask-sidecar micro-service #2605

Ongoing

  • Refactor computational sidecar to remove dependencies to RabbitMQ/PostgresDB #2530 99% completed
  • Connect director-v2 with dask-gateway(s) #2576
  • Setup Dask gateway on AWS cluster (easy setup, autoscale)
  • Reliability, UI, register clusters

Open

@sanderegg
Copy link
Member

sanderegg commented Jan 11, 2022

Update on sprint Meerkat

Done

  • new repository osparc-dask-gateway, Dask-gateway with oSparc backend, create a Dask-sidecar for each worker (a user owns a cluster, a user deploy the osparc-dask-gateway, a user registers the gateway in oSparc)
  • oSparc:
    • a user can register a cluster providing an entrypoint and user/password
    • a user may share the registered cluster among other users/groups
    • keeps a pool of connections to the internal cluster and any number of external osparc-dask-gateways
    • sends pipeline of computational tasks to any cluster
    • receives logs, results from any cluster
  • scaling:
    • user can add/remove machines manually to already running cluster
    • dask-gateway will use the machines accordingly

Ongoing

  • osparc-dask-gateway: SSL entrypoint, authentications, simplify deployment
  • oSparc:
    • improve UI
    • test connection with cluster (provides feedback in case of issue/loss)
    • show cluster capabilities, running tasks
    • handle closed-source services
  • scaling:
    • provide auto-scaling on AWS

@sanderegg
Copy link
Member

sanderegg commented Feb 23, 2022

Update on sprint R. Schumann

Done

  • investigating persistency in dask-scheduler, reactions with restarts in director-v2 or dask-scheduler or dask-sidecar

Bug fixes

Changed

Open issue / ongoing

@sanderegg
Copy link
Member

sanderegg commented Apr 29, 2022

Update on sprint Macarons

Done

Cluster used for running project is persisted and retrieved:

Fixed on cluster preferences:

Retrieve cluster used vs available resources:

Open issue / ongoing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Zenhub label (Pleas do not modify) PO issue Created by Product owners
Projects
None yet
Development

No branches or pull requests

7 participants