ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113

RekGRpth · 2024-11-12T10:32:10Z

Rework Fix gprestore/gpbackup hanging in case the helper goes down

Commit 8060b4e starts a goroutine in gpbackup/gprestore which polls every 5
seconds if the helper has failed and cancels pending COPY commands via the
execution context. I think it's a big overhead to execute commands via ssh so
often, especially on large clusters.

In case of a fatal error on the helper, gprestore/gpbackup could hang forever.
gprestore hung because the COPY command was expecting data from a pipe file
(via 'cat <pipe_file>') which was deleted in the helper's DoCleanup function
before any data was put into the pipe by the restore helper, when the restore
helper exited due to some error. gpbackup hung for similar reasons - the backup
helper exited before it opened the pipe for reading.

The new solution is quite simple: on exit, if was not sigpiped, the helper
opens and closes the pipes that the COPY command is waiting for, which also
causes it to exit. Now we don't need pipe lists in helpers, because an error in
a helper can occur before these lists are filled.

The test has been modified to show and fix more relevant results.

helper/helper.go

helper/restore_helper.go

end_to_end/end_to_end_suite_test.go

helper/helper.go

RekGRpth · 2024-11-15T08:47:54Z

I've backed up data from a 6-segment cluster and injected an error in the data (backup data attached - 6-segment-backup-with-error.tar.gz).
When I try to restore to a 3-segment cluster with jobs: gprestore --verbose --plugin-config <PATH_TO_PLUGIN_CONFIG> --resize-cluster --jobs 3 -timestamp 20241115131608 the restore hangs.

Reproduced, but the behavior is very strange: helpers did not complete, but expect something from zombie-plugins.

I returned contexts to successfully handle the situation when the helper dies with signal 9 or 11.

whitehawk · 2024-11-21T04:54:44Z

It looks that --on-error-continue is not working properly now.

If I use previously attached 6-segment backup with injected error on a 1-segment cluster with --on-error-continue:

gprestore --verbose --plugin-config <PATH_TO_PLUGIN_CONFIG> --resize-cluster --jobs 1 -timestamp 20241115131608 --on-error-continue

it hangs for 300 sec timeout on 2nd and 3rd tables, and finally doesn't restore them.

RekGRpth · 2024-11-21T05:13:46Z

It looks that --on-error-continue is not working properly now.

If I use previously attached 6-segment backup with injected error on a 1-segment cluster with --on-error-continue:
gprestore --verbose --plugin-config <PATH_TO_PLUGIN_CONFIG> --resize-cluster --jobs 1 -timestamp 20241115131608 --on-error-continue
it hangs for 300 sec timeout on 2nd and 3rd tables, and finally doesn't restore them.

This is not a problem with this patch, because it is reproduced without it!
Problem is solved by #120.
Also, I checked that SIGPIPE is received with plugin too.

helper/helper.go

RekGRpth added 2 commits November 12, 2024 13:30

rework

fbba35b

fix

934657e

RekGRpth changed the title ~~ADBDEV-6641: Adbdev 6641~~ ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down Nov 12, 2024

RekGRpth added 2 commits November 12, 2024 17:07

fix

8448381

fix

6d6d64d

RekGRpth marked this pull request as ready for review November 12, 2024 15:14

dkovalev1 reviewed Nov 13, 2024

View reviewed changes

helper/helper.go Outdated Show resolved Hide resolved

dkovalev1 reviewed Nov 13, 2024

View reviewed changes

helper/restore_helper.go Outdated Show resolved Hide resolved

RekGRpth added 2 commits November 13, 2024 09:49

optimize

d7d9dcd

move

c3cbd73

This comment was marked as resolved.

Sign in to view

RekGRpth added 2 commits November 13, 2024 13:38

partial restore

99885ff

rework solutiuon

62a594c

This comment was marked as resolved.

Sign in to view

RekGRpth marked this pull request as draft November 14, 2024 03:36

optimize and simplify

9a9db2e

RekGRpth marked this pull request as ready for review November 14, 2024 04:23

This comment was marked as resolved.

Sign in to view

whitehawk reviewed Nov 14, 2024

View reviewed changes

end_to_end/end_to_end_suite_test.go Outdated Show resolved Hide resolved

RekGRpth added 4 commits November 14, 2024 17:02

remove context

e19a15f

fix tests

a30092d

rm

c5dd6db

stabellize test

5181a10

This comment was marked as resolved.

Sign in to view

whitehawk reviewed Nov 15, 2024

View reviewed changes

end_to_end/end_to_end_suite_test.go Outdated Show resolved Hide resolved

signal pipe instead open/close

1907c3c

dkovalev1 reviewed Nov 15, 2024

View reviewed changes

helper/helper.go Outdated Show resolved Hide resolved

dkovalev1 reviewed Nov 15, 2024

View reviewed changes

helper/helper.go Outdated Show resolved Hide resolved

update log messages

2a83c5b

dkovalev1 reviewed Nov 15, 2024

View reviewed changes

helper/helper.go Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

RekGRpth added 2 commits November 15, 2024 13:42

restore contexts

a1628ea

simplify

8a94b59

RekGRpth added 5 commits November 15, 2024 15:25

stabellize test

9fb4727

open/close

484b242

timeout

5fe1f3c

Merge branch 'master' into ADBDEV-6641

f5330bf

open/close pipe inly if not terminated

0c9f756

RekGRpth force-pushed the ADBDEV-6641 branch from 869d865 to 0c9f756 Compare November 20, 2024 09:49

sigpiped

a5e1a93

whitehawk reviewed Nov 22, 2024

View reviewed changes

helper/helper.go Show resolved Hide resolved

RekGRpth added 2 commits November 22, 2024 12:49

comment

5ddd474

Merge branch 'master' into ADBDEV-6641

f1e7036

whitehawk previously approved these changes Nov 22, 2024

View reviewed changes

dkovalev1 reviewed Nov 29, 2024

View reviewed changes

helper/helper.go Outdated Show resolved Hide resolved

dkovalev1 reviewed Nov 29, 2024

View reviewed changes

helper/helper.go Show resolved Hide resolved

use atomic

8fa7ff0

RekGRpth dismissed whitehawk’s stale review via 8fa7ff0 November 29, 2024 15:42

simplify

e1d0fad

dkovalev1 approved these changes Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113

ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113

RekGRpth commented Nov 12, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

RekGRpth commented Nov 15, 2024

whitehawk commented Nov 21, 2024

RekGRpth commented Nov 21, 2024 •

edited

Loading

ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113

Are you sure you want to change the base?

ADBDEV-6641: Rework Fix gprestore/gpbackup hanging in case the helper goes down #113

Conversation

RekGRpth commented Nov 12, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

RekGRpth commented Nov 15, 2024

whitehawk commented Nov 21, 2024

RekGRpth commented Nov 21, 2024 • edited Loading

RekGRpth commented Nov 12, 2024 •

edited

Loading

RekGRpth commented Nov 21, 2024 •

edited

Loading