Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113

Open
wants to merge 27 commits into
base: master
Choose a base branch
from

Conversation

RekGRpth
Copy link
Member

@RekGRpth RekGRpth commented Nov 12, 2024

Rework Fix gprestore/gpbackup hanging in case the helper goes down

Commit 8060b4e starts a goroutine in gpbackup/gprestore which polls every 5
seconds if the helper has failed and cancels pending COPY commands via the
execution context. I think it's a big overhead to execute commands via ssh so
often, especially on large clusters.

In case of a fatal error on the helper, gprestore/gpbackup could hang forever.
gprestore hung because the COPY command was expecting data from a pipe file
(via 'cat <pipe_file>') which was deleted in the helper's DoCleanup function
before any data was put into the pipe by the restore helper, when the restore
helper exited due to some error. gpbackup hung for similar reasons - the backup
helper exited before it opened the pipe for reading.

The new solution is quite simple: on exit, if was not sigpiped, the helper
opens and closes the pipes that the COPY command is waiting for, which also
causes it to exit. Now we don't need pipe lists in helpers, because an error in
a helper can occur before these lists are filled.

The test has been modified to show and fix more relevant results.

@RekGRpth RekGRpth changed the title ADBDEV-6641: Adbdev 6641 ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down Nov 12, 2024
@RekGRpth RekGRpth marked this pull request as ready for review November 12, 2024 15:14
helper/helper.go Outdated Show resolved Hide resolved
helper/restore_helper.go Outdated Show resolved Hide resolved
@whitehawk

This comment was marked as resolved.

@RekGRpth

This comment was marked as resolved.

@whitehawk

This comment was marked as resolved.

@RekGRpth RekGRpth marked this pull request as draft November 14, 2024 03:36
@RekGRpth RekGRpth marked this pull request as ready for review November 14, 2024 04:23
@RekGRpth

This comment was marked as resolved.

@whitehawk

This comment was marked as resolved.

helper/helper.go Outdated Show resolved Hide resolved
helper/helper.go Outdated Show resolved Hide resolved
helper/helper.go Outdated Show resolved Hide resolved
@RekGRpth

This comment was marked as resolved.

@RekGRpth
Copy link
Member Author

I've backed up data from a 6-segment cluster and injected an error in the data (backup data attached - 6-segment-backup-with-error.tar.gz).
When I try to restore to a 3-segment cluster with jobs: gprestore --verbose --plugin-config <PATH_TO_PLUGIN_CONFIG> --resize-cluster --jobs 3 -timestamp 20241115131608 the restore hangs.

Reproduced, but the behavior is very strange: helpers did not complete, but expect something from zombie-plugins.

I returned contexts to successfully handle the situation when the helper dies with signal 9 or 11.

@whitehawk
Copy link

It looks that --on-error-continue is not working properly now.

If I use previously attached 6-segment backup with injected error on a 1-segment cluster with --on-error-continue:

gprestore --verbose --plugin-config <PATH_TO_PLUGIN_CONFIG> --resize-cluster --jobs 1 -timestamp 20241115131608 --on-error-continue

it hangs for 300 sec timeout on 2nd and 3rd tables, and finally doesn't restore them.

@RekGRpth
Copy link
Member Author

RekGRpth commented Nov 21, 2024

It looks that --on-error-continue is not working properly now.

If I use previously attached 6-segment backup with injected error on a 1-segment cluster with --on-error-continue:

gprestore --verbose --plugin-config <PATH_TO_PLUGIN_CONFIG> --resize-cluster --jobs 1 -timestamp 20241115131608 --on-error-continue

it hangs for 300 sec timeout on 2nd and 3rd tables, and finally doesn't restore them.

This is not a problem with this patch, because it is reproduced without it!
Problem is solved by #120.
Also, I checked that SIGPIPE is received with plugin too.

whitehawk
whitehawk previously approved these changes Nov 22, 2024
helper/helper.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants