-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pghoard: ignore delta backup failures counter in some cases #621
Conversation
Cases: 1. Backup was requested by operator: `AVN-PROD service request-backup` 2. More than `backup_interval` have passed since the last unsuccessful attempt [BF-2390]
On CI environment threads don't have enough time for initialization.
since_last_fail_interval = utc_now() - last_failed_time if last_failed_time else None | ||
if metadata["backup-reason"] == BackupReason.requested: | ||
self.log.info("Re-trying delta basebackup. Backup was requested") | ||
elif backup_interval and since_last_fail_interval and since_last_fail_interval > backup_interval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to discuss whether we need to consider backoff retry internal for this two new cases. Just keeping retrying without waiting a bit seems less meaningful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not do any backoff in case if the backup was requested manually, it's up to the user to decide if backup needs to be tried.
As for the second case, I am not quite sure it makes sense because if the backup used to fail and maxed out and is now failing, probably the problem did not go away. Thus I would not waste resources and money on object storage calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Only let "requested" backup keep retrying.
@@ -900,6 +900,7 @@ def test_surviving_pg_receivewal_hickup(self, db, pghoard): | |||
os.makedirs(wal_directory, exist_ok=True) | |||
|
|||
pghoard.receivexlog_listener(pghoard.test_site, db.user, wal_directory) | |||
time.sleep(0.5) # waiting for thread setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those sleeps really necessary?
Pghoard will try making backup in this cases, еven if the retries are over:
backup_interval
have passed since the last unsuccessful attempt[BF-2390]
Pghoard can stuck in state, when it doesn't make backup after several failures. It just writes in log
Giving up backup after exceeding max retries
and only restart can help.