Fix stuck query after cancel or termination when segment is not responding #1132

whitehawk · 2024-11-25T11:38:54Z

Fix stuck query after cancel or termination when segment is not responding (#948)

Problem:
The following scenario caused a stuck state of a coordinator backend process:

At some moment one of the segments stopped processing incoming requests (the
reason itself is not important, as sometimes bad things may happen with any
segment, and the system should be able to recover).
A query, which dispatches requests to segments, was executed (for example,
select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments
was not responding, the coordinator hanged in 'checkDispatchResult' (it is
expected, and can be handled by the FTS).
The stuck query was canceled or terminated (by 'pg_cancel_backend' or
'pg_terminate_backend'). It didn't help to return from the stuck state, but
after this step the query became completely unrecoverable. Even after FTS had
detected the malfunction segment and had promoted the mirror, the query was
still hanging forever.

Expected behavior is that after FTS mirror promotion, all stuck queries are
unblocked and canceled successfully.

Note: if FTS mirror promotion happened before step 3, the FTS canceled the
query successfully.

Root cause:
During cancel or termination, the coordinator tried to do abort of the current
transaction, and it hanged in the function 'internal_cancel' (called from
PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover,
after the timeout expired, it looped forever trying to do 'poll' again. So, if
the socket was opened on the segment side, but nobody replied (as the segment
process became not available for some reason), 'internal_cancel' had no way to
return.

Fix:
Once FTS promotes a mirror, it sends the signal to the coordinator
postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the
signal, the coordinator postmaster sends the SIGUSR1 signal (with
PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the
backend receives the signal, if it is in the state of cancelling or terminating
of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before
calling the 'poll' system call. If it is set, it will return with an error.
Thus:
a. if the FTS promotion happens before cancel/terminate, the query will be
canceled by the old logic;
b. if the FTS promotion happens after cancel/terminate, but before the
'internal_cancel' calls the 'poll', 'internal_cancel' will return an error
without calling the 'poll';
c. if the FTS promotion happens when the 'poll' is already called, the 'poll'
will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be
called, as the flag is set.

(cherry picked from commit 96ed1ad)

Differences comparing to the original commit:

exec_cmd_on_segments() is placed inside setup.sql file instead of
server_helper.sql, as the latter doesn't exist in GPDB 7 (and all functions from
this file are located in setup.sql)
The check 'GpIdentity.segindex == MASTER_CONTENT_ID' in 'sigusr1_handler()'
is replaced with 'IS_QUERY_DISPATCHER()', which is equivalent. It is done,
because MASTER_CONTENT_ID was renamed to COORDINATOR_CONTENT_ID in GPDB 7.
'TermSignalReceived' global var was removed in GPDB 7 (when it was merged
with PostgreSQL 9.5), so it is brought back.
exec_cmd_on_segments() is reworked due to limitation in GPDB 7. Previous
version failed with "ERROR: EXECUTE ON ALL SEGMENTS is only supported for
set-returning functions", so now it returns a table with TEXT field instead of
a single string. Also, exec_cmd_on_segments() is changed to use 'plpython3u'
language instead of 'plpythonu'.
Test is updated to work on GPDB 7.
File 'isolation2_resgroup_schedule' doesn't exist anymore, so the test is
added to 'isolation2_resgroup_v1_schedule' and 'isolation2_resgroup_v2_schedule'
files.

Should be rebased.

…nding (#948) Problem: The following scenario caused a stuck state of a coordinator backend process: 1. At some moment one of the segments stopped processing incoming requests (the reason itself is not important, as sometimes bad things may happen with any segment, and the system should be able to recover). 2. A query, which dispatches requests to segments, was executed (for example, select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments was not responding, the coordinator hanged in 'checkDispatchResult' (it is expected, and can be handled by the FTS). 3. The stuck query was canceled or terminated (by 'pg_cancel_backend' or 'pg_terminate_backend'). It didn't help to return from the stuck state, but after this step the query became completely unrecoverable. Even after FTS had detected the malfunction segment and had promoted the mirror, the query was still hanging forever. Expected behavior is that after FTS mirror promotion, all stuck queries are unblocked and canceled successfully. Note: if FTS mirror promotion happened before step 3, the FTS canceled the query successfully. Root cause: During cancel or termination, the coordinator tried to do abort of the current transaction, and it hanged in the function 'internal_cancel' (called from PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover, after the timeout expired, it looped forever trying to do 'poll' again. So, if the socket was opened on the segment side, but nobody replied (as the segment process became not available for some reason), 'internal_cancel' had no way to return. Fix: Once FTS promotes a mirror, it sends the signal to the coordinator postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the signal, the coordinator postmaster sends the SIGUSR1 signal (with PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the backend receives the signal, if it is in the state of cancelling or terminating of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before calling the 'poll' system call. If it is set, it will return with an error. Thus: a. if the FTS promotion happens before cancel/terminate, the query will be canceled by the old logic; b. if the FTS promotion happens after cancel/terminate, but before the 'internal_cancel' calls the 'poll', 'internal_cancel' will return an error without calling the 'poll'; c. if the FTS promotion happens when the 'poll' is already called, the 'poll' will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be called, as the flag is set. (cherry picked from commit 96ed1ad) Differences comparing to the original commit: 1. exec_cmd_on_segments() is placed inside setup.sql file instead of server_helper.sql, as the latter doesn't exist in GPDB 7 (and all functions from this file are located in setup.sql) 2. The check 'GpIdentity.segindex == MASTER_CONTENT_ID' in 'sigusr1_handler()' is replaced with 'IS_QUERY_DISPATCHER()', which is equivalent. It is done, because MASTER_CONTENT_ID was renamed to COORDINATOR_CONTENT_ID in GPDB 7. 3. 'TermSignalReceived' global var was removed in GPDB 7 (when it was merged with PostgreSQL 9.5), so it is brought back. 4. exec_cmd_on_segments() is reworked due to limitation in GPDB 7. Previous version failed with "ERROR: EXECUTE ON ALL SEGMENTS is only supported for set-returning functions", so now it returns a table with TEXT field instead of a single string. Also, exec_cmd_on_segments() is changed to use 'plpython3u' language instead of 'plpythonu'. 5. Test is updated to work on GPDB 7. 6. File 'isolation2_resgroup_schedule' doesn't exist anymore, so the test is added to 'isolation2_resgroup_v1_schedule' and 'isolation2_resgroup_v2_schedule' files.

whitehawk force-pushed the ADBDEV-5704 branch 2 times, most recently from 62faa56 to 4b25afa Compare November 26, 2024 00:49

whitehawk force-pushed the ADBDEV-5704 branch from 4b25afa to 8001db6 Compare November 26, 2024 01:58

whitehawk marked this pull request as ready for review November 26, 2024 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stuck query after cancel or termination when segment is not responding #1132

Fix stuck query after cancel or termination when segment is not responding #1132

whitehawk commented Nov 25, 2024 •

edited

Loading

Fix stuck query after cancel or termination when segment is not responding #1132

Are you sure you want to change the base?

Fix stuck query after cancel or termination when segment is not responding #1132

Conversation

whitehawk commented Nov 25, 2024 • edited Loading

whitehawk commented Nov 25, 2024 •

edited

Loading