Multi-datacenter deployment #809

lollipopman · 2024-12-19T21:38:33Z

lollipopman
Dec 19, 2024

I am in the process of evaluating and designing an OpenBao installation here at the Wikimedia Foundation. We have two main datacenters one of which is the primary and one which is the secondary, at any given point in time. I would like to design an OpenBao installation which allows for failing over from the primary to secondary OpenBao instance located in each datacenter.

Since Raft streaming replication is not present in the OSS fork, the two options I have explored are:

Raft storage engine: two separate OpenBao clusters, one in each datacenter. Regular snapshots are taken of the primary instance and loaded into the secondary instance.
PostgreSQL storage engine: Two separate clusters, one in each datacenter. PostgreSQL streaming replication is used to replicate from the primary to secondary instance.

Both options have trade offs, and I am suffering a bit from analysis paralysis on deciding which one to choose.

Does anyone in the community have experience with similar setups, or advice on evaluation criteria which I should consider?

Answered by cipherboy

Dec 19, 2024

@lollipopman Some thoughts... Happy to collaborate more; if you want to chat about architecture, I'm willing to have a call.

Horizontal scalability is the biggest shortcoming I think we have at the moment that's not already in progress. We have the existing HA mode that I'd look to extend for this, but in this mode presently, a single leader is active and the other nodes cannot service any requests.

For Raft in particular, @JanMa recently implemented non-voter node support. This theoretically allows you to do distributed Raft replication, whereby a local cluster (say, 3 voting nodes) could be mirrored by external, non-voters in other DR zones. While they still contribute to bandwidth usag…

View full answer

cipherboy · 2024-12-19T22:30:41Z

cipherboy
Dec 19, 2024
Maintainer

@lollipopman Some thoughts... Happy to collaborate more; if you want to chat about architecture, I'm willing to have a call.

Horizontal scalability is the biggest shortcoming I think we have at the moment that's not already in progress. We have the existing HA mode that I'd look to extend for this, but in this mode presently, a single leader is active and the other nodes cannot service any requests.

For Raft in particular, @JanMa recently implemented non-voter node support. This theoretically allows you to do distributed Raft replication, whereby a local cluster (say, 3 voting nodes) could be mirrored by external, non-voters in other DR zones. While they still contribute to bandwidth usage (to ship the updates), they don't participate in write confirmation votes and so won't impact latency as much (whereas a voter in a different replication zone counts to quorum and thus if you have a sufficient ratio of local vs remote, you will need votes from remote nodes to continue and you risk having a remote node elected leader).

You would need a DR failover process in the event all voting nodes go down with that approach, promoting some non-voters to voters to restore services.

For PostgreSQL... there is HA support. What happens is nodes race to acquire a lock and it is assumed the database itself handles replication (or all nodes are connected to the same instance). I'm not quite as familiar with the properties of PostgreSQL streaming replication and how that'd impact HA mode or performance. I think a similar thing as non-voters (but with the with PostgreSQL HA lock acquisition) might also be useful to do, long-term, especially for horizontal scalability: if your node is talking to a secondary replica database, you'd probably don't want to have it attempt to become a leader and instead have anything talking to primary cluster be considered for active node status.

The thing I've liked about Raft is that it is relatively known and supported upstream. PostgreSQL wasn't, and I suspect there are edge cases people just haven't run into. One that I've been told should theoretically impact us is that all data is stored in a single large, pseudo-KV table, which could result in poor performance as the number of entries grows. However, Raft also seems to run into performance problems north of 15GB+ depending on your hardware performance and who you talk to.

Long-term, I'm playing the idea of splitting storage into segments based on namespaces, potentially even using different storage engines for different namespaces and distributing leadership across the cluster segmented by namespaces (so each namespaces has a single writer for strong consistency via local locking still, but you can get higher write throughput by adding additional nodes assuming your writes are distributed across different mounts in different namespaces).

I am starting to put together a working group for horizontal scaling support; if you're interested in participating, let me know.

4 replies

lollipopman Dec 20, 2024
Author

@lollipopman Some thoughts... Happy to collaborate more; if you want to chat about architecture, I'm willing to have a call.

Thanks for the detailed answer and offer to chat on a call

Horizontal scalability is the biggest shortcoming I think we have at the moment that's not already in progress. We have the existing HA mode that I'd look to extend for this, but in this mode presently, a single leader is active and the other nodes cannot service any requests.

Our initial purpose for Vault will be to replace a bespoke git based secret store used by Puppet. We have about 2,500 nodes at present, so I am hoping the single instance OpenBao active server will be able to service all of those requets.

For Raft in particular, @JanMa recently implemented non-voter node support. This theoretically allows you to do distributed Raft replication, whereby a local cluster (say, 3 voting nodes) could be mirrored by external, non-voters in other DR zones. While they still contribute to bandwidth usage (to ship the updates), they don't participate in write confirmation votes and so won't impact latency as much (whereas a voter in a different replication zone counts to quorum and thus if you have a sufficient ratio of local vs remote, you will need votes from remote nodes to continue and you risk having a remote node elected leader).

You would need a DR failover process in the event all voting nodes go down with that approach, promoting some non-voters to voters to restore services.

I saw that commit, and briefly thought it might be an option, but then I disregarded it without digging deeper, thanks for bringing it up again. Perhaps I can run a six node cluster with three voting members in one datacenter and three non-voting members in the secondary datacenter. Then during failure I would orchestrate some type of dance to promote the three non-voting members to voting and then subsequently demote the original three voting members? I'll do some prototyping today to see if that is possible.

For PostgreSQL... there is HA support. What happens is nodes race to acquire a lock and it is assumed the database itself handles replication (or all nodes are connected to the same instance). I'm not quite as familiar with the properties of PostgreSQL streaming replication and how that'd impact HA mode or performance. I think a similar thing as non-voters (but with the with PostgreSQL HA lock acquisition) might also be useful to do, long-term, especially for horizontal scalability: if your node is talking to a secondary replica database, you'd probably don't want to have it attempt to become a leader and instead have anything talking to primary cluster be considered for active node status.

The thing I've liked about Raft is that it is relatively known and supported upstream. PostgreSQL wasn't, and I suspect there are edge cases people just haven't run into. One that I've been told should theoretically impact us is that all data is stored in a single large, pseudo-KV table, which could result in poor performance as the number of entries grows. However, Raft also seems to run into performance problems north of 15GB+ depending on your hardware performance and who you talk to.

I had similar concerns, but the idea of using PostgreSQL's rock solid streaming replication to the secondary datacenter seemed possibly worth the use of a relatively immature storage option. However, upon reflection and your input, I think I will try to design around the Raft storage engine for now.

Long-term, I'm playing the idea of splitting storage into segments based on namespaces, potentially even using different storage engines for different namespaces and distributing leadership across the cluster segmented by namespaces (so each namespaces has a single writer for strong consistency via local locking still, but you can get higher write throughput by adding additional nodes assuming your writes are distributed across different mounts in different namespaces).

sound very interesting!

I am starting to put together a working group for horizontal scaling support; if you're interested in participating, let me know.

I would be happy to participate

JanMa Dec 20, 2024
Maintainer

Hey 👋 ,

some additional details regarding the non-voting peers.

Then during failure I would orchestrate some type of dance to promote the three non-voting members to voting and then subsequently demote the original three voting members? I'll do some prototyping today to see if that is possible.

Currently the only way to promote a permanent non-voter to a voter again, is by making it completely rejoin the cluster. Depending on how quick you want to perform the datacenter failover, I don't think this is a good option.

I plan to fix this shortcoming by extending the Raft Storage API to allow you to promote or demote nodes while they are joined to the cluster, but I haven't had the time yet to work on a PR. My goal would be to get this into our upcoming 2.2.0 release.

cipherboy Dec 20, 2024
Maintainer

@JanMa Ahh, right, non-voter node implementation is also only in the upcoming v2.2.0. So you'll need to build from main in order to test this @lollipopman.

cipherboy Dec 20, 2024
Maintainer

Our initial purpose for Vault will be to replace a bespoke git based secret store used by Puppet. We have about 2,500 nodes at present, so I am hoping the single instance OpenBao active server will be able to service all of those requets.

Depending on your workload, but based on what you've said, I think this should be sufficient even for full restarts. Given you are only using K/V and have a mostly read-heavy workload, OpenBao achieves very good read performance via bbolt with Raft on the right hardware.

I had similar concerns, but the idea of using PostgreSQL's rock solid streaming replication to the secondary datacenter seemed possibly worth the use of a relatively immature storage option. However, upon reflection and your input, I think I will try to design around the Raft storage engine for now.

I think PostgreSQL is fine for this use case too if you prefer it. There is an bao operator migrate command for moving data from one store type to another, if you change your mind later.

I would be happy to participate

Cool! If you drop me an email at [email protected], I'll start organizing a group after the holidays. :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenBao

Multi-datacenter deployment #809

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenBao

Multi-datacenter deployment #809

lollipopman Dec 19, 2024

Replies: 1 comment · 4 replies

cipherboy Dec 19, 2024 Maintainer

lollipopman Dec 20, 2024 Author

JanMa Dec 20, 2024 Maintainer

cipherboy Dec 20, 2024 Maintainer

cipherboy Dec 20, 2024 Maintainer

lollipopman
Dec 19, 2024

Replies: 1 comment 4 replies

cipherboy
Dec 19, 2024
Maintainer

lollipopman Dec 20, 2024
Author

JanMa Dec 20, 2024
Maintainer

cipherboy Dec 20, 2024
Maintainer

cipherboy Dec 20, 2024
Maintainer