-
Notifications
You must be signed in to change notification settings - Fork 87
Back up PostgreSQL databases #812
Comments
Started test full backup to S3. |
Initial full PgBackRest backup completed in 4238 minutes (~71 hours) and used up ~14 TB of space (compressed with lz4), but unfortunately it turned out that we won't be able to effectively do daily incremental backups with PgBackRest because it only supports file-level and not page-level incremental backups (pgbackrest/pgbackrest#959). Our database makes ~150 GB worth of changes every day (that's a rough estimate; if you were to Two tools that claim to support page-level incremental backups are:
I'll try them out. |
WAL-G took 10837 minutes (181 hours, or almost 8 days) to complete, which is disappointing :( The backup consists of 28,2k (15.9 TB) files, and in those 8 days PostgreSQL wrote an extra 185k (1.7 TB) WALs. One thing to try is the somewhat undocumented I'll try doing the incremental (delta) backup and see if it's faster. |
A few incremental backups managed to finish too. Some stats:
Assuming that we'll want to:
I've created a little "model" (for the lack of a better term) to visualize how much disk space we'll need to store these backups + WALs: https://docs.google.com/spreadsheets/d/1Pe0Z12Y3eFzdz2bzwdeovRi2LVmmLueGhEXJYQFvXus/edit#gid=0 Notes:
Some things might be confusing, so feel free to ask away. Grand conclusion: even if my calculations are off (and they very well might be here and there), B2 is the cheapest option at an average expense of $250/month and $500 for restoring the backup (if we ever need to do it). @rahulbot, OK for us to start backing up to Backblaze B2? |
Started initial backup to B2. |
Initial backup to B2 exited because the PostgreSQL container went away (got killed by someone / something):
at around:
Closest relevant messages to that time were Docker complaining about not being able to talk to other nodes:
So I'm guessing network problems at Holyoke? Especially considering that Icinga spent the whole day complaining about lost ping packets. Cleaned up an old backup, trying to do a new one again. |
Started a test restore on EC2 to find out how much time it would take us to restore the backup if we ever needed to do it. Should run for about a day. |
Even though there are no signs (that I could find) that it has OOMed, I've still switched the compression method from LZ4 to Brotli (wal-g/wal-g#224). Trying to make a new backup again. |
Initial full backup restore (without WALs) from S3 took 4636 minutes (3,2 days) on Trying to restore individual WALs on top of a restored full backup. |
Given that disks on which we run our services are pre-shot or at least very old, and the disk health review + replacement is still ongoing, we need to improve PostgreSQL backups by implementing the 3-2-1 strategy.
Current state of affairs is as follows:
-systems
repo, we have azfs-send.py
script which used to run once a day and would copy a ZFS snapshot of PostgreSQL data to another server, sinclair. However, we suspended this script due to Holyoke's annual shutdown and haven't resumed its operation since. Plus, sinclair, the server that is being backed up to, has even older disks, at least one of which is shot already, so it's like weTherefore, a to-do:
zfs-send.py
periodically copy PostgreSQL dataset from woodward to sinclair again. Sinclair's disks are unreliable, but at least it's something.The text was updated successfully, but these errors were encountered: