How to do a high-availability rolling update? #4539

wolfchimneyrock · 2024-01-29T15:08:04Z

wolfchimneyrock
Jan 29, 2024

Description

Registry
Version: 2.5.8
Persistence type: sql

For our in progress high-availability Apicurio deployment, I am trying to understand how we will be able to do rolling upgrades without causing an outage window.

Environment

We are running Apicurio Registry on a clustered set of VM's, each VM connecting to a shared distributed PostgreSQL db.

When upgrades occur, the new software is installed and started on one VM at a time. If the process fails then a roll-back is usually automatically attempted before going to the next VM.

This typically lets us upgrade software without impacting availability during.

With Apicurio Registry, it looks like database schema changes can occur between releases which are automatically applied when the software is started. Also there is a strict requirement that only one database version can work with a given software version.

because of this, when the first VM gets its upgraded software it will upgrade the database, and the other VM's will continue to serve client requests running the old software fetching from the new database. I don't see any indication that this will always work, as sometimes updates remove tables or columns etc...

If there is an issue with the new software we can't roll-back to the prior version since the database has been upgraded and the old software doesn't work with a new database.

Do you have any insight on how we can achieve rolling upgrades?

One Idea is to remove the strict software - db version requirement, instead the software has a minimum db version requirement (even a window of two db versions allowed would help), since you could break db updates into two steps

add new features, deprecate old ones
remove old features

and then have a guarantee that any software version can work with two different database versions.

Answered by wolfchimneyrock

Mar 15, 2024

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a…

View full answer

@jsenko · 2024-01-29T15:08:07Z

apicurio-bot[bot]
bot Jan 29, 2024

Thank you for reporting an issue!

Pinging @jsenko to respond or triage.

0 replies

carlesarnal · 2024-03-13T09:09:16Z

carlesarnal
Mar 13, 2024
Maintainer

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

0 replies

wolfchimneyrock · 2024-03-15T14:48:01Z

wolfchimneyrock
Mar 15, 2024
Author

This is a very interesting one.

Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.

Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

We have already implemented an auth proxy in front of two apicurio instances (one rw, one ro since our db read throughput is much higher with a ro connection) which we can use to dynamically enable/disable write apis.

This is our current idea for migration, blue + green instances (each with their own db), supposing you start with blue active on 2.5.x and you want to upgrade to 2.5.y

while blue is serving requests, clear the green db and install version 2.5.y. verify the installation works.
enable read-only api access
export the blue db to .zip file - if this step fails we can backout
import the blue db into green - we can still backout
route some % of incoming traffic to green as a smoke test - we can still backout
route all traffic to green
re-enable write api

now green is active and we can keep blue on standby for some time if there is an issue with green

0 replies

carlesarnal · 2024-04-09T11:48:12Z

carlesarnal
Apr 9, 2024
Maintainer

This is a very interesting one.
Removing the constraint would not work in most scenarios, since the database change usually comes with code changes. An interesting idea would be to, just as we provide upgrade scripts for when this kind of situation happens, just provide downgrade scripts as well, where those scripts would be in charge of returning the database to a version compatible with the server.
Another interesting option would be to enable a read-only mode for the maintenance, then use a read-replica for the running VMs, upgrade the main database, and then the other VMs and the replica if everything is ok (another classic in this kind of situation). We've been working recently on a read-only mode that would help with this.

We have already implemented an auth proxy in front of two apicurio instances (one rw, one ro since our db read throughput is much higher with a ro connection) which we can use to dynamically enable/disable write apis.

This is our current idea for migration, blue + green instances (each with their own db), supposing you start with blue active on 2.5.x and you want to upgrade to 2.5.y

while blue is serving requests, clear the green db and install version 2.5.y. verify the installation works.

enable read-only api access

export the blue db to .zip file - if this step fails we can backout

import the blue db into green - we can still backout

route some % of incoming traffic to green as a smoke test - we can still backout

route all traffic to green

re-enable write api

now green is active and we can keep blue on standby for some time if there is an issue with green

Yes, this kind of approach makes a lot of sense and is similar to what I would have expected. The steps I described were obviously aiming at a managed database instance, not a self managed one. I'll transform this to a discussion and select your comment as the answer. Thanks!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to do a high-availability rolling update? #4539

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to do a high-availability rolling update? #4539

wolfchimneyrock Jan 29, 2024

Description

Environment

Replies: 4 comments

apicurio-bot[bot] bot Jan 29, 2024

carlesarnal Mar 13, 2024 Maintainer

wolfchimneyrock Mar 15, 2024 Author

carlesarnal Apr 9, 2024 Maintainer

wolfchimneyrock
Jan 29, 2024

apicurio-bot[bot]
bot Jan 29, 2024

carlesarnal
Mar 13, 2024
Maintainer

wolfchimneyrock
Mar 15, 2024
Author

carlesarnal
Apr 9, 2024
Maintainer