Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidelines for dynamic scaling? #23

Open
Mats-Elfving opened this issue Jan 18, 2024 · 0 comments
Open

Guidelines for dynamic scaling? #23

Mats-Elfving opened this issue Jan 18, 2024 · 0 comments

Comments

@Mats-Elfving
Copy link
Contributor

I am running the container on an AKS cluster on a node with VM scaleset with 1-4 pods using a Horizontal Pod Autoscaler and an automated Cluster Auto-scaler.The SHIR is running as a backend for Synapse.

With this setup I am scaling up a node and a pod if the CPU goes above 60% and scale down if it remains below 30% for at least 10 minutes.
As far as I've been able to test, after the scale up event there is a time of around 10 minutes before the node is available to pick up load.
Similar, the scale down event is much faster.

However - I've encountered multiple issues with this setup. Thus looking for any guidelines on how to do this as best-practice. Issues I perceive:

  • At the scale-up-event (actual event - before node is available). Some (all?) running copy activities fail on lost connection.
  • At scale up - when new node is available there seem to be a situation where there are multiple activities in Queue, yet none is re-directed to the new node. Instead once an activity on the previous nodes completes, any new activity may be scheduled on the new node, while the ones on queue will only be scheduled on the previous node.
  • Sometimes the node appears good as a pod, i.e. the health-check script returns good status. Yet the Synapse monitor states there is a connectivity issue, sometimes there is also an error in the pod-log. Does the health check only check things are running inside the pod? Would it not make more sense to verify that not only is it running - but also it has a good contact with synapse?
  • The scale down-event would potentially kill and fail any low-effort-activity, it does not necessarily terminate the node gracefully. Any way to have the SHIR cluster shift the load on terminate request?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant