Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add progress based basebackup metrics #615

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

sebinsunny
Copy link
Contributor

This PR improves the monitoring of pg basebackups. During a backup, it regularly checks how much data has been uploaded and compares this to the last recorded amount in a persisted progress file. If the upload is progressing, it updates the record with the new data and current time. If the backup has not advanced compared to the previous value, it reports the time elapsed since the last progress and emits stalled metrics. Once a backup is complete, the record is reset.

[SRE-7631]

@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 3 times, most recently from cf6b3ef to c46ee6a Compare February 7, 2024 07:01
@codecov-commenter
Copy link

codecov-commenter commented Feb 7, 2024

Codecov Report

Attention: Patch coverage is 70.10309% with 29 lines in your changes are missing coverage. Please review.

Project coverage is 90.95%. Comparing base (2b2ea98) to head (6e00a1c).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #615      +/-   ##
==========================================
- Coverage   91.10%   90.95%   -0.16%     
==========================================
  Files          32       32              
  Lines        4823     4918      +95     
==========================================
+ Hits         4394     4473      +79     
- Misses        429      445      +16     
Files Coverage Δ
pghoard/basebackup/base.py 92.25% <100.00%> (+0.03%) ⬆️
pghoard/transfer.py 95.78% <64.70%> (-0.09%) ⬇️
pghoard/common.py 90.96% <80.35%> (-2.25%) ⬇️
pghoard/basebackup/delta.py 91.20% <45.45%> (-4.05%) ⬇️

... and 3 files with indirect coverage changes

rikonen
rikonen previously requested changes Feb 7, 2024
Copy link
Contributor

@rikonen rikonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a proper review but happened to open this pr and the logic for updating the json file on disk is problematic.

pghoard/common.py Outdated Show resolved Hide resolved
@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 4 times, most recently from 2348ff0 to cbc096b Compare February 8, 2024 05:17
@sebinsunny sebinsunny dismissed rikonen’s stale review February 8, 2024 05:20

Addressed the changes

@sebinsunny sebinsunny requested a review from rikonen February 8, 2024 05:20
@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 6 times, most recently from 68ef95e to d868541 Compare February 9, 2024 00:41
Copy link
Contributor

@RommelLayco RommelLayco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix tests. and i have some questions about the code

test/test_transferagent.py Show resolved Hide resolved
pghoard/transfer.py Outdated Show resolved Hide resolved
pghoard/transfer.py Show resolved Hide resolved
pghoard/transfer.py Outdated Show resolved Hide resolved
pghoard/common.py Outdated Show resolved Hide resolved
test/test_common.py Outdated Show resolved Hide resolved
test/test_common.py Outdated Show resolved Hide resolved
test/test_common.py Outdated Show resolved Hide resolved
test/test_common.py Outdated Show resolved Hide resolved
@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 3 times, most recently from afe89ac to 7e84361 Compare February 9, 2024 04:08
@sebinsunny sebinsunny dismissed RommelLayco’s stale review February 9, 2024 04:21

changes are addressed

@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 2 times, most recently from 11c6a5a to c2f61c5 Compare February 12, 2024 04:21
@sebinsunny sebinsunny requested a review from alexole February 12, 2024 23:16
Copy link
Contributor

@RommelLayco RommelLayco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks good to me. just one more question

@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 2 times, most recently from e44c744 to bd1d7d6 Compare February 17, 2024 05:06
pghoard/common.py Outdated Show resolved Hide resolved
@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 2 times, most recently from 175290b to d76b15e Compare February 21, 2024 23:29
pghoard/common.py Outdated Show resolved Hide resolved
@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 2 times, most recently from d1b36de to 83b5931 Compare March 11, 2024 05:26
snapshotter.snapshot(reuse_old_snapshotfiles=False)
def progress_callback(msg: ProgressStep, progress_data: ProgressMetrics):
key = "snapshot_progress"
persisted_progress = PersistedProgress.read(self.metrics)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to cause a lot of read/writes. We should implement something that mostly updates in memory and writes every now and again as this callback is called a lot AFAIK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update this

@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch 4 times, most recently from e386c66 to 5a186f5 Compare March 18, 2024 00:02
…gularly checks how much data has been uploaded and compares this to the last recorded amount in a persisted progress file. If the upload is progressing, it updates the record with the new data and current time. If the backup has not advanced compared to the previous value, it reports the time elapsed since the last progress and emits stalled metrics. Once a backup is complete, the record is reset.

[SRE-7631]
@sebinsunny sebinsunny force-pushed the sebinsunny-refactor-pg-backup-metrics branch from 5a186f5 to 6e00a1c Compare March 18, 2024 02:30
@facetoe facetoe merged commit 078c81f into main Mar 18, 2024
7 checks passed
@facetoe facetoe deleted the sebinsunny-refactor-pg-backup-metrics branch March 18, 2024 22:24
sebinsunny added a commit that referenced this pull request May 31, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request May 31, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request May 31, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request May 31, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request Jun 5, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request Jun 5, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request Jun 7, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
sebinsunny added a commit that referenced this pull request Jun 7, 2024
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation.

[SRE-7476]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants