Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎨Computational backend: DV-2 computational scheduler becomes replicable (🗃️🚨) #6736

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Nov 15, 2024

What do these changes do?

This PR heavily refactors the director-v2 internal computational scheduler by using tools to allow multiple director-v2 replicas to be able to share the load of scheduling computational pipelines.

As reminder the computational scheduler in the director-v2 is responsible for:

  1. analyzing a project's computational pipeline and to find which nodes need to be (re-)run when the user starts a computation,
  2. scheduling computational jobs according to the computational pipeline (what can run and what shall wait, retrieving results/errors and transmitting progress) by communicating with the shared Dask cluster or with private Dask clusters

Until this PR, replicating the director-v2 would also duplicate network calls and end in wasted resources.

This PR aims to make the replication of the director-v2 more efficient by:

  • having a Manager that periodically schedules (5s) the pipelines, relying on the Database information directly, exclusively (e.g. only 1 director-v2 replica can do it at a time - using distributed locks),
  • these tasks are published in a RabbitMQ shared queue named simcore.services.director-v2.scheduling,
  • each director-v2 replica contains COMPUTATIONAL_BACKEND_SCHEDULING_CONCURRENCY workers that apply the scheduling and communicate with the deeper backend (e.g. Dask), this allows to share the load between the different workers, so in effect every new replica of the director-v2 shall relieve the other ones,
  • COMPUTATIONAL_BACKEND_SCHEDULING_CONCURRENCY is currently hard-coded to 50. Some testing will be necessary to see whether that is too low or too high. That is why this is not at the moment an ENV variable. It will be converted if necessary.
  • 🗃️: comp_runs table is upgraded to contain new nullable scheduled and processed columns, this is used to keep track of when a pipeline was scheduled by the manager, and when the worker has processed it.
  • 🗃️: comp_runs table is upgraded to contain only timezone-enabled timestamps
  • 🗃️: comp_tasks table is upgraded to contain only timezone-enabled timestamps

Schematic

---
config:
  theme: mc
  layout: dagre
  look: handDrawn
---
flowchart LR
 subgraph s1["Director-v2.1"]
        n1["Manager:<br>5s: Schedule all pipelines"]
        n3["Worker1"]
        n4["Worker2"]
        n5["WorkerN"]
        n6["schedule pipeline1"]
        n7["schedule pipeline2"]
        n8["schedule pipeline3"]
  end
 subgraph s2["Cluster-UserX"]
        n9["Dask-Scheduler"]
        n11["Dask-Worker(s)"]
  end
 subgraph s3["Cluster-UserY"]
        n14["Dask-Scheduler"]
        n15["Dask-Worker(s)"]
  end
 subgraph s4["Cluster-UserZ"]
        n19["Dask-Scheduler"]
        n20["Dask-Worker(s)"]
  end
 subgraph s5["Director-v2.2"]
        n21["Manager:<br>5s: Schedule all pipelines"]
        n22["Worker1"]
        n23["Worker2"]
        n24["WorkerN"]
        n25["schedule pipeline4"]
        n26["schedule pipeline5"]
        n27["schedule pipeline6"]
  end
 subgraph s6["Cluster-UserA"]
        n90["Dask-Scheduler"]
        n91["Dask-Worker(s)"]
  end
 subgraph s7["Cluster-UserB"]
        n92["Dask-Scheduler"]
        n93["Dask-Worker(s)"]
  end
 subgraph s8["Cluster-UserC"]
        n94["Dask-Scheduler"]
        n95["Dask-Worker(s)"]
  end
    n1 ==> n2["RabbitMQ"]
    n2 --> n3 & n4 & n5 & n22 & n23 & n24
    n3 --> n6
    n4 --> n7
    n5 --> n8
    n6 --> n9
    n7 --> n14
    n8 --> n19
    n22 --> n25
    n23 --> n26
    n24 --> n27
    n21 ==> n2
    n25 --> n90
    n26 --> n92
    n27 --> n94
    n21@{ shape: rect}
    n22@{ shape: rect}
    n23@{ shape: rect}
    n24@{ shape: rect}
    n25@{ shape: rect}
    n26@{ shape: rect}
    n27@{ shape: rect}
    n2@{ shape: cyl}
    style n1 stroke:#D50000
    style n6 stroke:#D50000
    style n7 stroke:#D50000
    style n8 stroke:#D50000
    style n21 stroke:#D50000,stroke-width:1px,stroke-dasharray: 1
    style n25 stroke:#D50000
    style n26 stroke:#D50000
    style n27 stroke:#D50000


Loading

Legend:

  • in $${\color{red}red}$$ are blocks that run exclusively via Redis locks

Related issue/s

How to test

Dev-ops checklist

@sanderegg sanderegg added the a:director-v2 issue related with the director-v2 service label Nov 15, 2024
@sanderegg sanderegg added this to the Event Horizon milestone Nov 15, 2024
@sanderegg sanderegg self-assigned this Nov 15, 2024
Copy link

codecov bot commented Nov 15, 2024

Codecov Report

Attention: Patch coverage is 98.63946% with 4 lines in your changes missing coverage. Please review.

Project coverage is 88.51%. Comparing base (994c575) to head (a0e3990).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6736      +/-   ##
==========================================
+ Coverage   86.93%   88.51%   +1.58%     
==========================================
  Files        1553     1550       -3     
  Lines       61866    61702     -164     
  Branches     2110     2108       -2     
==========================================
+ Hits        53781    54618     +837     
+ Misses       7754     6753    -1001     
  Partials      331      331              
Flag Coverage Δ
integrationtests 64.90% <93.23%> (+0.14%) ⬆️
unittests 86.42% <97.61%> (+1.76%) ⬆️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library 93.49% <ø> (+0.21%) ⬆️
pkg_dask_task_models_library 97.09% <ø> (+0.52%) ⬆️
pkg_models_library 91.25% <100.00%> (+0.02%) ⬆️
pkg_notifications_library 84.57% <ø> (ø)
pkg_postgres_database 87.76% <100.00%> (+0.23%) ⬆️
pkg_service_integration 70.00% <ø> (ø)
pkg_service_library 76.07% <100.00%> (+0.16%) ⬆️
pkg_settings_library 90.62% <ø> (ø)
pkg_simcore_sdk 85.38% <ø> (+0.02%) ⬆️
agent 97.00% <ø> (ø)
api_server 89.72% <ø> (ø)
autoscaling 95.21% <ø> (ø)
catalog 90.57% <ø> (ø)
clusters_keeper 98.73% <ø> (ø)
dask_sidecar 91.26% <ø> (ø)
datcore_adapter 93.17% <ø> (ø)
director 76.40% <ø> (ø)
director_v2 91.58% <98.49%> (+0.46%) ⬆️
dynamic_scheduler 96.59% <ø> (ø)
dynamic_sidecar 89.75% <ø> (ø)
efs_guardian 90.12% <ø> (ø)
invitations 93.44% <ø> (ø)
osparc_gateway_server 85.49% <ø> (ø)
payments 92.77% <ø> (ø)
resource_usage_tracker 90.71% <ø> (-0.08%) ⬇️
storage 89.66% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.72% <ø> (+4.90%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 994c575...a0e3990. Read the comment docs.

@sanderegg sanderegg force-pushed the computational-backend/release-scheduler-gil3 branch 8 times, most recently from 89783ed to b53c9b8 Compare November 24, 2024 14:16
@sanderegg sanderegg force-pushed the computational-backend/release-scheduler-gil3 branch 3 times, most recently from 4679bd1 to bf472ba Compare November 26, 2024 08:56
@sanderegg sanderegg changed the title 🎨Computational backend: Refactoring dv-2 computational scheduling (Part 3) 🎨Computational backend: DV-2 computational scheduler becomes replicable Nov 26, 2024
@sanderegg sanderegg force-pushed the computational-backend/release-scheduler-gil3 branch from bf472ba to aab6aab Compare November 26, 2024 15:20
@sanderegg sanderegg changed the title 🎨Computational backend: DV-2 computational scheduler becomes replicable 🎨Computational backend: DV-2 computational scheduler becomes replicable (🗃️) Nov 26, 2024
@sanderegg sanderegg force-pushed the computational-backend/release-scheduler-gil3 branch 5 times, most recently from d27ff76 to 496aa9f Compare November 28, 2024 15:56
@sanderegg sanderegg marked this pull request as ready for review November 28, 2024 15:57
@sanderegg sanderegg force-pushed the computational-backend/release-scheduler-gil3 branch from 0fe3496 to a0e3990 Compare December 2, 2024 10:07
Copy link

sonarqubecloud bot commented Dec 2, 2024

@sanderegg sanderegg merged commit a2f9058 into ITISFoundation:master Dec 2, 2024
89 of 90 checks passed
@sanderegg sanderegg deleted the computational-backend/release-scheduler-gil3 branch December 2, 2024 10:36
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting! Looking forward to seeing multiple director-v2 replicas. Unfortunately I am not deep enough in the details to give a more thorough review, but I did discover one little thing. 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:director-v2 issue related with the director-v2 service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants