Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash on PG 9.6.5 make installcheck #15

Closed
anse1 opened this issue Sep 9, 2017 · 3 comments
Closed

crash on PG 9.6.5 make installcheck #15

anse1 opened this issue Sep 9, 2017 · 3 comments
Assignees
Labels

Comments

@anse1
Copy link

anse1 commented Sep 9, 2017

Hi,

running the PostgreSQL 9.6.5 regression tests on a database with powa + pg_qualstats installed has a 30% chance of crashing for me while running the object_address test. Backtrace below.

regards,
andreas

    Program terminated with signal SIGSEGV, Segmentation fault.
    #0  pgqs_collectNodeStats (planstate=0x557e4d260cd8, context=context@entry=0x557e4d26f368, ancestors=0x0) at pg_qualstats.c:938
    938		total_filtered = planstate->instrument->nfiltered1 + planstate->instrument->nfiltered2;
    (gdb) bt
    #0  pgqs_collectNodeStats (planstate=0x557e4d260cd8, context=context@entry=0x557e4d26f368, ancestors=0x0) at pg_qualstats.c:938
    #1  0x00007f3d6b55fa0d in pgqs_ExecutorEnd (queryDesc=0x557e4d25ebe8) at pg_qualstats.c:695
    #2  0x0000557e4b3c96c3 in PortalCleanup (portal=0x557e4d154d58) at portalcmds.c:280
    #3  0x0000557e4b64bbf3 in PortalDrop (portal=0x557e4d154d58, isTopCommit=<optimized out>) at portalmem.c:510
    #4  0x0000557e4b43020c in SPI_cursor_close (portal=<optimized out>) at spi.c:1487
    #5  0x00007f3d61f9904e in exec_stmt_fors (stmt=0x557e4d22e748, estate=0x7ffc6d6fdf80) at pl_exec.c:2121
    #6  exec_stmt (stmt=0x557e4d22e748, estate=0x7ffc6d6fdf80) at pl_exec.c:1467
    #7  exec_stmts (estate=0x7ffc6d6fdf80, stmts=<optimized out>) at pl_exec.c:1398
    #8  0x00007f3d61f9c6f1 in exec_for_query (estate=estate@entry=0x7ffc6d6fdf80, stmt=stmt@entry=0x557e4d22e570, portal=0x557e4d154c40, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
    #9  0x00007f3d61f9903e in exec_stmt_fors (stmt=0x557e4d22e570, estate=0x7ffc6d6fdf80) at pl_exec.c:2116
    #10 exec_stmt (stmt=0x557e4d22e570, estate=0x7ffc6d6fdf80) at pl_exec.c:1467
    #11 exec_stmts (estate=0x7ffc6d6fdf80, stmts=<optimized out>) at pl_exec.c:1398
    #12 0x00007f3d61f9c6f1 in exec_for_query (estate=estate@entry=0x7ffc6d6fdf80, stmt=stmt@entry=0x557e4d22e1c8, portal=0x557e4d154b28, prefetch_ok=prefetch_ok@entry=1 '\001') at pl_exec.c:5209
    #13 0x00007f3d61f9903e in exec_stmt_fors (stmt=0x557e4d22e1c8, estate=0x7ffc6d6fdf80) at pl_exec.c:2116
    #14 exec_stmt (stmt=0x557e4d22e1c8, estate=0x7ffc6d6fdf80) at pl_exec.c:1467
    #15 exec_stmts (estate=0x7ffc6d6fdf80, stmts=<optimized out>) at pl_exec.c:1398
    #16 0x00007f3d61f9b627 in exec_stmt_block (estate=estate@entry=0x7ffc6d6fdf80, block=0x557e4d22f678) at pl_exec.c:1336
    #17 0x00007f3d61f9b845 in plpgsql_exec_function (func=func@entry=0x557e4d20e930, fcinfo=fcinfo@entry=0x7ffc6d6fe1c0, simple_eval_estate=simple_eval_estate@entry=0x557e4d223da0) at pl_exec.c:434
    #18 0x00007f3d61f901f1 in plpgsql_inline_handler (fcinfo=<optimized out>) at pl_handler.c:329
    #19 0x0000557e4b62e262 in OidFunctionCall1Coll (functionId=<optimized out>, collation=collation@entry=0, arg1=94000947505744) at fmgr.c:1592
    #20 0x0000557e4b3bef72 in ExecuteDoStmt (stmt=stmt@entry=0x557e4d202d00) at functioncmds.c:2177
    #21 0x0000557e4b52d178 in standard_ProcessUtility (parsetree=0x557e4d202d00, queryString=0x557e4d1bfde0 [...], context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x557e4d1c14f0, completionTag=0x7ffc6d6ff060 "") at utility.c:515
    #22 0x00007f3d6b768d1d in pgss_ProcessUtility (parsetree=0x557e4d202d00, queryString=0x557e4d1bfde0 [...], context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x557e4d1c14f0, completionTag=0x7ffc6d6ff060 "") at pg_stat_statements.c:986
    #23 0x0000557e4b529f34 in PortalRunUtility (portal=0x557e4d154a10, utilityStmt=0x557e4d202d00, isTopLevel=<optimized out>, setHoldSnapshot=<optimized out>, dest=0x557e4d1c14f0, completionTag=0x7ffc6d6ff060 "") at pquery.c:1193
    #24 0x0000557e4b52aad2 in PortalRunMulti (portal=portal@entry=0x557e4d154a10, isTopLevel=isTopLevel@entry=1 '\001', setHoldSnapshot=setHoldSnapshot@entry=0 '\000', dest=dest@entry=0x557e4d1c14f0, altdest=altdest@entry=0x557e4d1c14f0, completionTag=completionTag@entry=0x7ffc6d6ff060 "") at pquery.c:1349
    #25 0x0000557e4b52b81e in PortalRun (portal=portal@entry=0x557e4d154a10, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', dest=dest@entry=0x557e4d1c14f0, altdest=altdest@entry=0x557e4d1c14f0, completionTag=completionTag@entry=0x7ffc6d6ff060 "") at pquery.c:815
    #26 0x0000557e4b528f62 in exec_simple_query (query_string=0x557e4d1bfde0 [...]) at postgres.c:1086
    #27 PostgresMain (argc=<optimized out>, argv=argv@entry=0x557e4d1636b0, dbname=<optimized out>, username=<optimized out>) at postgres.c:4074
    #28 0x0000557e4b27e13d in BackendRun (port=0x557e4d15fb50) at postmaster.c:4294
    #29 BackendStartup (port=0x557e4d15fb50) at postmaster.c:3968
    #30 ServerLoop () at postmaster.c:1719
    #31 0x0000557e4b4be2e3 in PostmasterMain (argc=3, argv=0x557e4d135520) at postmaster.c:1327
    #32 0x0000557e4b27f6bd in main (argc=3, argv=0x557e4d135520) at main.c:228
    (gdb) list
    933		if (list_length(parent) > 1)
    934		{
    935			context->uniquequalid = hashExpr((Expr *) parent, context, true);
    936			context->qualid = hashExpr((Expr *) parent, context, false);
    937		}
    938		total_filtered = planstate->instrument->nfiltered1 + planstate->instrument->nfiltered2;
    939		context->nbfiltered = planstate->instrument->nfiltered1 + planstate->instrument->nfiltered2;
    940		context->count = planstate->instrument->tuplecount + planstate->instrument->ntuples + total_filtered;
    941		/* Add the indexquals */
    942		context->evaltype = 'i';
    (gdb) p planstate->instrument
    $1 = (Instrumentation *) 0x0
    (gdb) p debug_query_string
    $2 = 0x557e4d1bfde0 "DO $$\nDECLARE\n\tobjtype text;\n\tnames\ttext[];\n\targs\ttext[];\nBEGIN\n\tFOR objtype IN VALUES\n\t\t('table'), ('index'), ('sequence'), ('view'),\n\t\t('materialized view'), ('foreign table'),\n\t\t('table column'), ('foreign table column'),\n\t\t('aggregate'), ('function'), ('type'), ('cast'),\n\t\t('table constraint'), ('domain constraint'), ('conversion'), ('default value'),\n\t\t('operator'), ('operator class'), ('operator family'), ('rule'), ('trigger'),\n\t\t('text search parser'), ('text search dictionary'),\n\t\t('text search template'), ('text search configuration'),\n\t\t('policy'), ('user mapping'), ('default acl'), ('transform'),\n\t\t('operator of access method'), ('function of access method')\n\tLOOP\n\t\tFOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, zwei, drei}')\n\t\tLOOP\n\t\t\tFOR args IN VALUES ('{}'), ('{integer}')\n\t\t\tLOOP\n\t\t\t\tBEGIN\n\t\t\t\t\tPERFORM pg_get_object_address(objtype, names, args);\n\t\t\t\tEXCEPTION WHEN OTHERS THEN\n\t\t\t\t\t\tRAISE WARNING 'error for %,%,%: %', objtype, names, args, sqlerrm;\n\t\t\t\tEND;\n\t\t\tEND LOOP;\n\t\tEND LOOP;\n\tEND LOOP;\nEND;\n$$;"
    (gdb) 

@rjuju rjuju added the bug label Sep 9, 2017
@rjuju rjuju self-assigned this Sep 9, 2017
@rjuju
Copy link
Member

rjuju commented Sep 9, 2017

Hello,

Thanks a lot for the detailed report and the analysis! I can also reproduce this issue. I'll dig into this, since I don't see an obvious reason why no instrumentation is available here.

@rjuju
Copy link
Member

rjuju commented Sep 11, 2017

Ok I understand the issue. FTR, the related statement is

DO $$
DECLARE
    objtype text;
    names    text[];
    args    text[];
BEGIN
    FOR objtype IN VALUES
        ('table'), ('index'), ('sequence'), ('view'),
        ('materialized view'), ('foreign table'),
        ('table column'), ('foreign table column'),
        ('aggregate'), ('function'), ('type'), ('cast'),
        ('table constraint'), ('domain constraint'), ('conversion'), ('default value'),
        ('operator'), ('operator class'), ('operator family'), ('rule'), ('trigger'),
        ('text search parser'), ('text search dictionary'),
        ('text search template'), ('text search configuration'),
        ('policy'), ('user mapping'), ('default acl'), ('transform'),
        ('operator of access method'), ('function of access method')
    LOOP
        FOR names IN VALUES ('{eins}'), ('{addr_nsp, zwei}'), ('{eins, zwei, drei}')
        LOOP
            FOR args IN VALUES ('{}'), ('{integer}')
            LOOP
                BEGIN
                    PERFORM pg_get_object_address(objtype, names, args);
                EXCEPTION WHEN OTHERS THEN
                        RAISE WARNING 'error for %,%,%: %', objtype, names, args, sqlerrm;
                END;
            END LOOP;
        END LOOP;
    END LOOP;
END;
$$;

I'm not familiar with plpgsql internals, but IIUC with such nested FOR loop, multiple executor are started in a nested way. Therefore, each executor start will be seen as a nested_level of 0, so query_is_sampled will be set independantly. The problem is that query_is_sampled is a global variable.

So if the first executor isn't sampled but the second is, query_is_sampled will be set to true, and the first call to pgqs_ExecutorEnd will think that the query was sampled while it wasn't, try to read the instrument field and happily fail.

I don't really see an easy way to fix it. We could of course add an if (planstate->instrument) in pgqs_ExecutorEnd to ignore "false positive sampled queries", but we'd still miss the opposite case. Also, we'd need to make sure it wasn't something else, like auto_explain adding some instrumentations.

The easiet and safest solution I can imagine would be to save the query_is_sampled information in the QueryDesc, but after a quick look I don't see any custom field available nearby.

I guess we could track start/stop sequences and keep the same sampled state for any inner executor, but if any error happens, executor end won't be called, which means the backend will be forever stuck in either sampled state :(

Do you have any idea on a proper fix?

@rjuju
Copy link
Member

rjuju commented Oct 22, 2017

Hi @anse1, sorry for late answer.

I couldn't find an easy solution to fix this issue, so I just added a test on the effective instrumentation to avoid the "falise positive" case. The point of sampling is to avoid overhead with highly loaded servers, so missing some queries in such cases shouldn't make any difference.

Fix pushed in ae73c10, thanks again for the report!

@rjuju rjuju closed this as completed Oct 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants