pghoard: ignore delta backup failures counter in some cases #621

egor-voynov-aiven · 2024-04-09T14:52:53Z

Pghoard will try making backup in this cases, еven if the retries are over:

Backup was requested by operator (called related http endpoint)
More than backup_interval have passed since the last unsuccessful attempt

[BF-2390]

Pghoard can stuck in state, when it doesn't make backup after several failures. It just writes in log Giving up backup after exceeding max retries and only restart can help.

[BF-2390]

Cases: 1. Backup was requested by operator: `AVN-PROD service request-backup` 2. More than `backup_interval` have passed since the last unsuccessful attempt [BF-2390]

On CI environment threads don't have enough time for initialization.

0xlianhu · 2024-05-02T16:21:07Z

pghoard/pghoard.py

+                since_last_fail_interval = utc_now() - last_failed_time if last_failed_time else None
+                if metadata["backup-reason"] == BackupReason.requested:
+                    self.log.info("Re-trying delta basebackup. Backup was requested")
+                elif backup_interval and since_last_fail_interval and since_last_fail_interval > backup_interval:


Need to discuss whether we need to consider backoff retry internal for this two new cases. Just keeping retrying without waiting a bit seems less meaningful to me.

I would not do any backoff in case if the backup was requested manually, it's up to the user to decide if backup needs to be tried.
As for the second case, I am not quite sure it makes sense because if the backup used to fail and maxed out and is now failing, probably the problem did not go away. Thus I would not waste resources and money on object storage calls.

Agree. Only let "requested" backup keep retrying.

alexole · 2024-05-27T08:03:14Z

test/test_pghoard.py

@@ -900,6 +900,7 @@ def test_surviving_pg_receivewal_hickup(self, db, pghoard):
        os.makedirs(wal_directory, exist_ok=True)

        pghoard.receivexlog_listener(pghoard.test_site, db.user, wal_directory)
+        time.sleep(0.5)  # waiting for thread setup


Are those sleeps really necessary?

egor-voynov-aiven added 2 commits April 9, 2024 16:52

pghoard: convert 'metadata["backup-reason"]' value to enum

5eca45d

[BF-2390]

pghoard: ignore delta backup failures counter in some cases

3363fb5

Cases: 1. Backup was requested by operator: `AVN-PROD service request-backup` 2. More than `backup_interval` have passed since the last unsuccessful attempt [BF-2390]

egor-voynov-aiven changed the title ~~pghoard: convert 'metadata["backup-reason"]' value to enum~~ pghoard: ignore delta backup failures counter in some cases Apr 10, 2024

egor-voynov-aiven marked this pull request as ready for review April 10, 2024 10:26

pghoard: fix flaky test test_surviving_pg_receivewal_hickup

c479f65

On CI environment threads don't have enough time for initialization.

egor-voynov-aiven requested a review from alexole April 10, 2024 11:52

0xlianhu reviewed May 2, 2024

View reviewed changes

alexole reviewed May 27, 2024

View reviewed changes

alexole approved these changes May 27, 2024

View reviewed changes

alexole merged commit a1da446 into main May 27, 2024
7 checks passed

alexole deleted the egor-voynov-reset-backup-attempt-counter branch May 27, 2024 09:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pghoard: ignore delta backup failures counter in some cases #621

pghoard: ignore delta backup failures counter in some cases #621

egor-voynov-aiven commented Apr 9, 2024 •

edited

Loading

0xlianhu May 2, 2024

alexole May 27, 2024

0xlianhu May 27, 2024

alexole May 27, 2024

pghoard: ignore delta backup failures counter in some cases #621

pghoard: ignore delta backup failures counter in some cases #621

Conversation

egor-voynov-aiven commented Apr 9, 2024 • edited Loading

0xlianhu May 2, 2024

Choose a reason for hiding this comment

alexole May 27, 2024

Choose a reason for hiding this comment

0xlianhu May 27, 2024

Choose a reason for hiding this comment

alexole May 27, 2024

Choose a reason for hiding this comment

egor-voynov-aiven commented Apr 9, 2024 •

edited

Loading