Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pghoard: ignore delta backup failures counter in some cases #621

Merged
merged 3 commits into from
May 27, 2024

Conversation

egor-voynov-aiven
Copy link
Contributor

@egor-voynov-aiven egor-voynov-aiven commented Apr 9, 2024

Pghoard will try making backup in this cases, еven if the retries are over:

  1. Backup was requested by operator (called related http endpoint)
  2. More than backup_interval have passed since the last unsuccessful attempt

[BF-2390]

Pghoard can stuck in state, when it doesn't make backup after several failures. It just writes in log Giving up backup after exceeding max retries and only restart can help.

Cases:
1. Backup was requested by operator: `AVN-PROD service request-backup`
2. More than `backup_interval` have passed since the last unsuccessful attempt

[BF-2390]
@egor-voynov-aiven egor-voynov-aiven changed the title pghoard: convert 'metadata["backup-reason"]' value to enum pghoard: ignore delta backup failures counter in some cases Apr 10, 2024
@egor-voynov-aiven egor-voynov-aiven marked this pull request as ready for review April 10, 2024 10:26
On CI environment threads don't have enough time for initialization.
since_last_fail_interval = utc_now() - last_failed_time if last_failed_time else None
if metadata["backup-reason"] == BackupReason.requested:
self.log.info("Re-trying delta basebackup. Backup was requested")
elif backup_interval and since_last_fail_interval and since_last_fail_interval > backup_interval:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss whether we need to consider backoff retry internal for this two new cases. Just keeping retrying without waiting a bit seems less meaningful to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not do any backoff in case if the backup was requested manually, it's up to the user to decide if backup needs to be tried.
As for the second case, I am not quite sure it makes sense because if the backup used to fail and maxed out and is now failing, probably the problem did not go away. Thus I would not waste resources and money on object storage calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Only let "requested" backup keep retrying.

@@ -900,6 +900,7 @@ def test_surviving_pg_receivewal_hickup(self, db, pghoard):
os.makedirs(wal_directory, exist_ok=True)

pghoard.receivexlog_listener(pghoard.test_site, db.user, wal_directory)
time.sleep(0.5) # waiting for thread setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those sleeps really necessary?

@alexole alexole merged commit a1da446 into main May 27, 2024
7 checks passed
@alexole alexole deleted the egor-voynov-reset-backup-attempt-counter branch May 27, 2024 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants