-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add progress based basebackup metrics #615
Conversation
cf6b3ef
to
c46ee6a
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #615 +/- ##
==========================================
- Coverage 91.10% 90.95% -0.16%
==========================================
Files 32 32
Lines 4823 4918 +95
==========================================
+ Hits 4394 4473 +79
- Misses 429 445 +16
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a proper review but happened to open this pr and the logic for updating the json file on disk is problematic.
2348ff0
to
cbc096b
Compare
68ef95e
to
d868541
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please fix tests. and i have some questions about the code
afe89ac
to
7e84361
Compare
11c6a5a
to
c2f61c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks good to me. just one more question
e44c744
to
bd1d7d6
Compare
175290b
to
d76b15e
Compare
d1b36de
to
83b5931
Compare
pghoard/basebackup/delta.py
Outdated
snapshotter.snapshot(reuse_old_snapshotfiles=False) | ||
def progress_callback(msg: ProgressStep, progress_data: ProgressMetrics): | ||
key = "snapshot_progress" | ||
persisted_progress = PersistedProgress.read(self.metrics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to cause a lot of read/writes. We should implement something that mostly updates in memory and writes every now and again as this callback is called a lot AFAIK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update this
e386c66
to
5a186f5
Compare
…gularly checks how much data has been uploaded and compares this to the last recorded amount in a persisted progress file. If the upload is progressing, it updates the record with the new data and current time. If the backup has not advanced compared to the previous value, it reports the time elapsed since the last progress and emits stalled metrics. Once a backup is complete, the record is reset. [SRE-7631]
5a186f5
to
6e00a1c
Compare
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
…reviously, we reset the basebackup progress file whenever a new basebackup request was made, which resulted in not catching a few cases where pghoard restarts. Now, the progress file is only reset when a backup is successful, and we also record the total bytes uploaded in the file for the previous basebackup. If there is a retry due to a pghoard restart or a failed backup request, we check if progress has been made; if it has not exceeded the bytes uploaded in the previous state, we emit a stalled metric. Also, added logging for upload progress for each file and snapshot stages in a basebackup operation. [SRE-7476]
This PR improves the monitoring of pg basebackups. During a backup, it regularly checks how much data has been uploaded and compares this to the last recorded amount in a persisted progress file. If the upload is progressing, it updates the record with the new data and current time. If the backup has not advanced compared to the previous value, it reports the time elapsed since the last progress and emits stalled metrics. Once a backup is complete, the record is reset.
[SRE-7631]