-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix redshift auto user deadlocking #43335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix seems incredibly fragile, shouldn't we be retrying deadlocked transactions? And if so, what's the problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the part i don't understand is why the deactivate/delete scripts don't deadlock with the activation script, which does revoke role -> alter user -> grant role.
Too bad Redshift does not support pg_advisory_lock
, which would be a cool easy fix.
Not sure how LOCK
works in Redshift. Seems you can explicitly call LOCK in the transaction in a specific order to the list of tables to avoid deadlock? Say lock tb1, tb2, tb3. (i did not test)
reactivation cases calls the deactivate function first, so it acquires and holds all the locks in the same order as deactivate before doing anything else. The locks are not released after that deactivate call.
I tried doing this but even as the db master user I could not lock the relevant tables - these oids correspond to pg_shadow, pg_identity, and a couple others.
As a fix to the general problem of deadlocking, yeah it's not a good fix. It does prevent teleport from deadlocking itself though, which is a nice "optimization", and it makes our failing e2e tests stop randomly deadlocking. tbh I've avoided doing anything with retries because I wasn't sure how to go about that - like do we need some backoff logic? should it be done in the script or in go? If you have some advice here, even from a normal postgres perspective, please share and I'll add retries. |
I think we should have at least a small handful of retries (and the transaction doing the user manipulation should probably run with serializable isolation?). I don't know if plpgsql is powerful enough to do that on its own or if we should do it from the client. |
The only recommendations about transaction retry on deadlock mentions deadlock as a secondary error one should retry, in https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html (with serialization failure being the primary error that should trigger a transaction retry). |
1e3305a
to
1af3bb4
Compare
It's not clear to me what isolation level these procedures are using, based on the redshift docs, which seem to contradict themselves.
Then BEGIN reference says: https://docs.aws.amazon.com/redshift/latest/dg/r_BEGIN.html
then there's this:
but that's a lie. It does lock, and that lock blocks other create/drop user operations. Changes are not visible on commit. I tested this by opening two transactions, dropping a user in one and committing, then checking users again. The user still appears to exist in the other transaction. It's a mess - I think it's using serializable isolation, but idk.
In Redshift, the only exception we can catch is "OTHERS", which is every kind of error.
So we have to retry from the client. |
dc90de3
to
82a09a0
Compare
@greedy52 since I added retry logic for the teardown code, please take another look. |
should we retry on the activation as well |
82a09a0
to
4ec91d4
Compare
done |
@espadolini this is ready for review again - I've added retries |
4ec91d4
to
23eb1c2
Compare
@GavinFrazar See the table below for backport results.
|
Changelog: fixed Redshift auto-user deactivation/deletion failure that occurs when a user is created or deleted and another user is deactivated concurrently.
Fixes:
TestDatabases/redshift_cluster
flakiness #41521This should fix our e2e test failures, I think. I can't reproduce failures after this.
Despite the issue name, it's not a flaky test issue.
Context
Our e2e test for redshift seemed to be flaky, but it was actually catching a legit concurrency bug (well not really a bug - deadlocks are kinda inevitable for serializable isolation):
I determined that the culprit is likely from the deactivation and deletion scripts acquiring the same table locks in a different order. The teleport cluster semaphore we acquire is per-user, so with two users in the tests (auto_keep/auto_drop), the bug happens. I ran these tests on my own redshift cluster to reproduce even when only one test is running at a time, so it wasn't due to other tests interfering.
From experimentation, the following table locks are acquired by various operations in our sql procedures:
Procedures always run in a transaction, and locks are held until the transaction either commits or rolls back.
What was happening, I think, is the following:
What this PR does
The fix is to alter the user conn limit first such that deactivate will acquire the locks [1260 -> 4771] in the same order as the delete script.
This also brings it in line with the order of acquisition for creating a user.
This does not fix deadlocking in the general case caused by external users/apps running queries concurrent with our procedures, which requires we implement retries. That goes for postgres and mysql too, probably. Postgres supports row level locking which Redshift doesn't - maybe that's why it doesn't fail. Or maybe it has a different isolation level default 🤷 . I am not going to try to fix the general case in this PR. The goal in this PR is to make our code not deadlock itself.I added some logging that I found helpful to debug and also updated the e2e test to install the procedures in a randomized schema for each test, to avoid other tests from redefining the procedure before we call it.
note that test failures may still occur until this PR is backported, since older branch tests can still acquire the locks in a different order and deadlock a master branch test run, although this is far less likely given the timing necessary to hit the deadlock.Edit: retry on deactivate/delete added as well. This affects only postgres/redshift databases.