Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨Autoscaling: scale down while in use 🚨 #6898

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Dec 3, 2024

What do these changes do?

Before this PR, the autoscaling would either scale UP or DOWN. Therefore if multiple tasks would be in the pipeline requiring one type of machine, and another set of tasks with a different type of machine came in the pipeline, the cluster scaling would deadlock as it would not be able to scale down the unused machines to make room for the machines required by the second set of tasks.

This PR aims to fix this:

  • on each round of autoscaling task, any empty machine where no task could be assigned will proceed with the draining process and the subsequent termination process, thus making space for any other type of machine.

🚨: This PR requires some manual testing to validate the approach works as expected

Bonus:

  • fixed autoscaling monitoring script with osparc.io
  • some refactoring to ease a bit mocking of docker swarm + AWS together

Related issue/s

How to test

Dev-ops checklist

@sanderegg sanderegg added the a:autoscaling autoscaling service in simcore's stack label Dec 3, 2024
@sanderegg sanderegg added this to the Event Horizon milestone Dec 3, 2024
@sanderegg sanderegg self-assigned this Dec 3, 2024
@sanderegg sanderegg force-pushed the autoscaling/bugfix/scale-down-while-in-use branch from f4930bb to 7004721 Compare December 3, 2024 18:45
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 3 lines in your changes missing coverage. Please review.

Project coverage is 88.15%. Comparing base (9012c4d) to head (3a40819).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6898      +/-   ##
==========================================
+ Coverage   88.11%   88.15%   +0.03%     
==========================================
  Files        1587     1580       -7     
  Lines       62156    61978     -178     
  Branches     2008     2008              
==========================================
- Hits        54771    54636     -135     
+ Misses       7050     7006      -44     
- Partials      335      336       +1     
Flag Coverage Δ
integrationtests 64.95% <ø> (-0.02%) ⬇️
unittests 86.35% <93.33%> (+0.01%) ⬆️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library 93.49% <ø> (ø)
pkg_dask_task_models_library 97.09% <ø> (ø)
pkg_models_library 91.24% <ø> (ø)
pkg_notifications_library 84.57% <ø> (ø)
pkg_postgres_database 88.07% <ø> (ø)
pkg_service_integration 70.02% <ø> (ø)
pkg_service_library 75.04% <ø> (-0.06%) ⬇️
pkg_settings_library 90.60% <ø> (ø)
pkg_simcore_sdk 85.38% <ø> (ø)
agent 97.00% <ø> (ø)
api_server 90.04% <ø> (ø)
autoscaling 95.42% <93.33%> (+0.21%) ⬆️
catalog 90.57% <ø> (ø)
clusters_keeper 99.48% <ø> (ø)
dask_sidecar 91.26% <ø> (ø)
datcore_adapter 93.18% <ø> (ø)
director 76.40% <ø> (-0.09%) ⬇️
director_v2 91.37% <ø> (-0.02%) ⬇️
dynamic_scheduler 96.99% <ø> (ø)
dynamic_sidecar 89.75% <ø> (ø)
efs_guardian 90.12% <ø> (ø)
invitations 93.44% <ø> (ø)
osparc_gateway_server ∅ <ø> (∅)
payments 92.65% <ø> (ø)
resource_usage_tracker 89.58% <ø> (-0.07%) ⬇️
storage 89.60% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 87.70% <ø> (+<0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9012c4d...3a40819. Read the comment docs.

@sanderegg sanderegg force-pushed the autoscaling/bugfix/scale-down-while-in-use branch 3 times, most recently from b5f913c to d6ad889 Compare December 10, 2024 16:01
@sanderegg sanderegg marked this pull request as ready for review December 10, 2024 17:51
@sanderegg sanderegg requested a review from pcrespov as a code owner December 10, 2024 17:51
@sanderegg sanderegg changed the title ✨Autoscaling: scale down while in use ✨Autoscaling: scale down while in use 🚨 Dec 10, 2024
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks a lot

Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didnt read the tests, rest looks very nice. one very minor comment

Copy link
Contributor

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sanderegg sanderegg force-pushed the autoscaling/bugfix/scale-down-while-in-use branch from 1a87a0b to 3dcde20 Compare December 11, 2024 14:36
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

@sanderegg sanderegg force-pushed the autoscaling/bugfix/scale-down-while-in-use branch from 3dcde20 to 3a40819 Compare December 11, 2024 17:07
@sanderegg sanderegg merged commit ff6f85a into ITISFoundation:master Dec 11, 2024
93 checks passed
@sanderegg sanderegg deleted the autoscaling/bugfix/scale-down-while-in-use branch December 11, 2024 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:autoscaling autoscaling service in simcore's stack
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autoscaling: scaling down unused machines while asking for other types of machines does not work
6 participants