-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No HA support for some components in ManageIQ #583
Comments
I think we should be able to safely scale httpd, that just needs to be tested. Maybe we just remove the replica bit from reconcile to let the user set their own value (or maybe we just ensure it's >= 1). I'm not sure about memcached. We store the user session in there so if it's getting load balanced it's possible the user will have to log in every time we hit a new memcached replica. The orchestrator is also difficult because of the way the manageiq app works. Currently we have the one orchestrator pod watching for all of the "server" records. I think doing some kind of dynamic work distribution will be hard and having each replica map to a specific server record is harder so we would have to either move to a deployment per server (similar to the queue workers) or solve this in some other way. Postgres is not mentioned here, but it's probably the most difficult if we want to roll our own container-based HA solution. I would sooner look for something that has already solved this problem possibly https://github.com/CrunchyData/postgres-operator ? That all said, I think we would need to nail down exactly what kind of HA we're looking to achieve with this. Something like active-active postgres is much harder than active-standby. Additionally the HA I'm talking about for PG is very different than just having two httpd pods. Suffice it to say this is not a bug ... it's a rather large enhancement. |
First question is, do you consider "supports multiple replicas" HA? Or are we talking about full multi-cluster active-active or active-passive with data replication type HA? |
Thanks @carbonin for the detailed explanation here, really very helpful!!
Here is just single cluster model with supports multiple replicas HA. |
Okay, so the goal here is to have very little interruption (less than the time to reschedule a pod) for something like a node failure. Generally this would mean every service supporting multiple replicas, but in a case like a database with persistent storage multiple replicas are not going to be the solution. This is probably a good enough description to go on for now. I'll add a checklist to the initial issue comment to cover the components that need work. |
The way memcached works is that it doesn't need load balancer in front. LB is part of the protocol, so to make it works memcached should be deployed using headless service. This however spells troubles for ManageIQ, because it relies on environment variable MEMCACHED_SERVICE_HOST and MEMCACHED_SERVICE_PORT to discover memcached. The catch is that when service is headless, kubernetes doesn't define those env variables. Causing whole deployment to collapse, as none of the pods that orchestrator deploys can't start. |
Currently, I can see the ManageIQ operator is hardcoding some of the deployment replicas as 1, like httpd, memcached etc, the problem is that this will not able to enable the HA deployment for ManageIQ.
Can we enable the customer set the replica number or just add a new field in CR like
enableHA
etc, and let customer to define if they want to use HA or not?FYI @carbonin @Fryguy @chessbyte
Components:
The text was updated successfully, but these errors were encountered: