Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stuck query after cancel or termination when segment is not responding #1132

Open
wants to merge 1 commit into
base: adb-7.2.0
Choose a base branch
from

Commits on Nov 26, 2024

  1. Fix stuck query after cancel or termination when segment is not respo…

    …nding (#948)
    
    Problem:
    The following scenario caused a stuck state of a coordinator backend process:
    1. At some moment one of the segments stopped processing incoming requests (the
    reason itself is not important, as sometimes bad things may happen with any
    segment, and the system should be able to recover).
    2. A query, which dispatches requests to segments, was executed (for example,
    select from 'gp_toolkit.gp_resgroup_status_per_segment'). As one of the segments
    was not responding, the coordinator hanged in 'checkDispatchResult' (it is
    expected, and can be handled by the FTS).
    3. The stuck query was canceled or terminated (by 'pg_cancel_backend' or
    'pg_terminate_backend'). It didn't help to return from the stuck state, but
    after this step the query became completely unrecoverable. Even after FTS had
    detected the malfunction segment and had promoted the mirror, the query was
    still hanging forever.
    
    Expected behavior is that after FTS mirror promotion, all stuck queries are
    unblocked and canceled successfully.
    
    Note: if FTS mirror promotion happened before step 3, the FTS canceled the
    query successfully.
    
    Root cause:
    During cancel or termination, the coordinator tried to do abort of the current
    transaction, and it hanged in the function 'internal_cancel' (called from
    PQcancel) on 'poll' system call. It has a timeout of 660 seconds, and, moreover,
    after the timeout expired, it looped forever trying to do 'poll' again. So, if
    the socket was opened on the segment side, but nobody replied (as the segment
    process became not available for some reason), 'internal_cancel' had no way to
    return.
    
    Fix:
    Once FTS promotes a mirror, it sends the  signal to the coordinator
    postmaster (with PMSIGNAL_FTS_PROMOTED_MIRROR reason). On receiving of the
    signal, the coordinator postmaster sends the SIGUSR1 signal (with
    PROCSIG_FTS_PROMOTED_MIRROR reason) to all of its usual backends. Once the
    backend receives the signal, if it is in the state of cancelling or terminating
    of the query, it sets a flag in libpq. 'internal_cancel' checks this flag before
    calling the 'poll' system call. If it is set, it will return with an error.
    Thus:
    a. if the FTS promotion happens before cancel/terminate, the query will be
    canceled by the old logic;
    b. if the FTS promotion happens after cancel/terminate, but before the
    'internal_cancel' calls the 'poll', 'internal_cancel' will return an error
    without calling the 'poll';
    c. if the FTS promotion happens when the 'poll' is already called, the 'poll'
    will return EINTR (as the SIGUSR1 was received), and a new 'poll' will not be
    called, as the flag is set.
    
    (cherry picked from commit 96ed1ad)
    
    Differences comparing to the original commit:
    1. exec_cmd_on_segments() is placed inside setup.sql file instead of
    server_helper.sql, as the latter doesn't exist in GPDB 7 (and all functions from
    this file are located in setup.sql)
    2. The check 'GpIdentity.segindex == MASTER_CONTENT_ID' in 'sigusr1_handler()'
    is replaced with 'IS_QUERY_DISPATCHER()', which is equivalent. It is done,
    because MASTER_CONTENT_ID was renamed to COORDINATOR_CONTENT_ID in GPDB 7.
    3. 'TermSignalReceived' global var was removed in GPDB 7 (when it was merged
    with PostgreSQL 9.5), so it is brought back.
    4. exec_cmd_on_segments() is reworked due to limitation in GPDB 7. Previous
    version failed with "ERROR:  EXECUTE ON ALL SEGMENTS is only supported for
    set-returning functions", so now it returns a table with TEXT field instead of
    a single string. Also, exec_cmd_on_segments() is changed to use 'plpython3u'
    language instead of 'plpythonu'.
    5. Test is updated to work on GPDB 7.
    6. File 'isolation2_resgroup_schedule' doesn't exist anymore, so the test is
    added to 'isolation2_resgroup_v1_schedule' and 'isolation2_resgroup_v2_schedule'
    files.
    whitehawk committed Nov 26, 2024
    Configuration menu
    Copy the full SHA
    8001db6 View commit details
    Browse the repository at this point in the history