You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running the container on an AKS cluster on a node with VM scaleset with 1-4 pods using a Horizontal Pod Autoscaler and an automated Cluster Auto-scaler.The SHIR is running as a backend for Synapse.
With this setup I am scaling up a node and a pod if the CPU goes above 60% and scale down if it remains below 30% for at least 10 minutes.
As far as I've been able to test, after the scale up event there is a time of around 10 minutes before the node is available to pick up load.
Similar, the scale down event is much faster.
However - I've encountered multiple issues with this setup. Thus looking for any guidelines on how to do this as best-practice. Issues I perceive:
At the scale-up-event (actual event - before node is available). Some (all?) running copy activities fail on lost connection.
At scale up - when new node is available there seem to be a situation where there are multiple activities in Queue, yet none is re-directed to the new node. Instead once an activity on the previous nodes completes, any new activity may be scheduled on the new node, while the ones on queue will only be scheduled on the previous node.
Sometimes the node appears good as a pod, i.e. the health-check script returns good status. Yet the Synapse monitor states there is a connectivity issue, sometimes there is also an error in the pod-log. Does the health check only check things are running inside the pod? Would it not make more sense to verify that not only is it running - but also it has a good contact with synapse?
The scale down-event would potentially kill and fail any low-effort-activity, it does not necessarily terminate the node gracefully. Any way to have the SHIR cluster shift the load on terminate request?
The text was updated successfully, but these errors were encountered:
I am running the container on an AKS cluster on a node with VM scaleset with 1-4 pods using a Horizontal Pod Autoscaler and an automated Cluster Auto-scaler.The SHIR is running as a backend for Synapse.
With this setup I am scaling up a node and a pod if the CPU goes above 60% and scale down if it remains below 30% for at least 10 minutes.
As far as I've been able to test, after the scale up event there is a time of around 10 minutes before the node is available to pick up load.
Similar, the scale down event is much faster.
However - I've encountered multiple issues with this setup. Thus looking for any guidelines on how to do this as best-practice. Issues I perceive:
The text was updated successfully, but these errors were encountered: