Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix redshift auto user deadlocking #43335

Merged
merged 1 commit into from
Jul 2, 2024

Conversation

GavinFrazar
Copy link
Contributor

@GavinFrazar GavinFrazar commented Jun 21, 2024

Changelog: fixed Redshift auto-user deactivation/deletion failure that occurs when a user is created or deleted and another user is deactivated concurrently.

Fixes:

This should fix our e2e test failures, I think. I can't reproduce failures after this.
Despite the issue name, it's not a flaky test issue.

Context

Our e2e test for redshift seemed to be flaky, but it was actually catching a legit concurrency bug (well not really a bug - deadlocks are kinda inevitable for serializable isolation):

{"caller":"postgres/engine.go:142","component":"db:service","db":"ci-database-e2e-tests-redshif
t-cluster-us-west-2-307493967395","error":"ERROR: deadlock detected (SQLSTATE 40P01)","id":"c5f
031d4-1c88-4231-b9b5-0ff070b02e8f","level":"error","message":"Failed to teardown auto user.","t
imestamp":"2024-05-27T14:21:37-07:00"}

I determined that the culprit is likely from the deactivation and deletion scripts acquiring the same table locks in a different order. The teleport cluster semaphore we acquire is per-user, so with two users in the tests (auto_keep/auto_drop), the bug happens. I ran these tests on my own redshift cluster to reproduce even when only one test is running at a time, so it wasn't due to other tests interfering.

From experimentation, the following table locks are acquired by various operations in our sql procedures:

format: [locked table oid -> followed by another table lock oid] (statement responsible)

[1260 -> 16980] (create user)
[1260 -> 4771] (drop user) 
[1260] (alter user)
[4775 -> 4771] (grant role)
[4775 -> 4771] (revoke role)

* activate script:
** new user:
    [1260 -> 16980] (create user)
    [4775 -> 4771] (grant roles)
** reactivation:
  [calls deactivate] (deactivate the user)
  [1260] (alter user connection limit)
  [4775 -> 4771] (grant roles)
  
* deactivate script:
[4775 -> 4771] (revoke roles)
[1260] (alter user connection limit)

* delete script:
[1260 -> 4771] (drop user)
[calls deactivate] (fallback to deactivate user on deletion fail)

Procedures always run in a transaction, and locks are held until the transaction either commits or rolls back.
What was happening, I think, is the following:

  1. deactivate script revokes roles (acquires 4771 lock, then 4775 lock)
  2. delete script drops its user (acquires 1260, blocks on locked 4771)
  3. deactivate script tries to alter user conn limit (blocks on locked 1260)
  4. and now we have a deadlock due to 4771 -> 1260 and 1260 -> 4771 lock orders.

What this PR does

The fix is to alter the user conn limit first such that deactivate will acquire the locks [1260 -> 4771] in the same order as the delete script.
This also brings it in line with the order of acquisition for creating a user.

This does not fix deadlocking in the general case caused by external users/apps running queries concurrent with our procedures, which requires we implement retries. That goes for postgres and mysql too, probably. Postgres supports row level locking which Redshift doesn't - maybe that's why it doesn't fail. Or maybe it has a different isolation level default 🤷 . I am not going to try to fix the general case in this PR. The goal in this PR is to make our code not deadlock itself.

I added some logging that I found helpful to debug and also updated the e2e test to install the procedures in a randomized schema for each test, to avoid other tests from redefining the procedure before we call it.

  • note that test failures may still occur until this PR is backported, since older branch tests can still acquire the locks in a different order and deadlock a master branch test run, although this is far less likely given the timing necessary to hit the deadlock.

Edit: retry on deactivate/delete added as well. This affects only postgres/redshift databases.

Copy link
Contributor

@espadolini espadolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix seems incredibly fragile, shouldn't we be retrying deadlocked transactions? And if so, what's the problem?

lib/srv/db/postgres/sql/redshift-activate-user.sql Outdated Show resolved Hide resolved
Copy link
Contributor

@greedy52 greedy52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the part i don't understand is why the deactivate/delete scripts don't deadlock with the activation script, which does revoke role -> alter user -> grant role.

Too bad Redshift does not support pg_advisory_lock, which would be a cool easy fix.

Not sure how LOCK works in Redshift. Seems you can explicitly call LOCK in the transaction in a specific order to the list of tables to avoid deadlock? Say lock tb1, tb2, tb3. (i did not test)

e2e/aws/redshift_test.go Show resolved Hide resolved
@GavinFrazar
Copy link
Contributor Author

GavinFrazar commented Jun 21, 2024

the part i don't understand is why the deactivate/delete scripts don't deadlock with the activation script, which does revoke role -> alter user -> grant role.

reactivation cases calls the deactivate function first, so it acquires and holds all the locks in the same order as deactivate before doing anything else. The locks are not released after that deactivate call.
Prior to this PR, it probably could deadlock between activate and delete, as a consequence of deactivate deadlocking with delete, but this case was not happening under tests because the timing is just very unlikely I guess. Users activation and teardown is done for the two users under test in parallel, so we just don't see one user created while another is torn down.

Seems you can explicitly call LOCK in the transaction in a specific order to the list of tables to avoid deadlock? Say lock tb1, tb2, tb3. (i did not test)

I tried doing this but even as the db master user I could not lock the relevant tables - these oids correspond to pg_shadow, pg_identity, and a couple others.
There's also not a SELECT FOR UPDATE in redshift.

This fix seems incredibly fragile, shouldn't we be retrying deadlocked transactions? And if so, what's the problem?

As a fix to the general problem of deadlocking, yeah it's not a good fix. It does prevent teleport from deadlocking itself though, which is a nice "optimization", and it makes our failing e2e tests stop randomly deadlocking.

tbh I've avoided doing anything with retries because I wasn't sure how to go about that - like do we need some backoff logic? should it be done in the script or in go? If you have some advice here, even from a normal postgres perspective, please share and I'll add retries.

@espadolini
Copy link
Contributor

I think we should have at least a small handful of retries (and the transaction doing the user manipulation should probably run with serializable isolation?). I don't know if plpgsql is powerful enough to do that on its own or if we should do it from the client.

@espadolini
Copy link
Contributor

The only recommendations about transaction retry on deadlock mentions deadlock as a secondary error one should retry, in https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html (with serialization failure being the primary error that should trigger a transaction retry).

@GavinFrazar GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch from 1e3305a to 1af3bb4 Compare June 25, 2024 00:02
@GavinFrazar
Copy link
Contributor Author

GavinFrazar commented Jun 25, 2024

and the transaction doing the user manipulation should probably run with serializable isolation

It's not clear to me what isolation level these procedures are using, based on the redshift docs, which seem to contradict themselves.

  • They say that databases by default use serializable isolation for all transactions.
  • Then elsewhere they say snapshot isolation (equivalent to repeatable read in postgres, I think) is the default.

Then BEGIN reference says: https://docs.aws.amazon.com/redshift/latest/dg/r_BEGIN.html

Though you can use any of the four transaction isolation levels, Amazon Redshift processes all isolation levels as serializable.

then there's this:
serializable isolation docs https://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html:

System catalog tables (PG) and other Amazon Redshift system tables (STL and STV) are not locked in a transaction. Therefore, changes to database objects that arise from DDL and TRUNCATE operations are visible on commit to any concurrent transactions.
...
Transactions for updates to these tables run in a read committed isolation mode. PG-prefix catalog tables don't support snapshot isolation.

but that's a lie. It does lock, and that lock blocks other create/drop user operations. Changes are not visible on commit. I tested this by opening two transactions, dropping a user in one and committing, then checking users again. The user still appears to exist in the other transaction.

It's a mess - I think it's using serializable isolation, but idk.

The only recommendations about transaction retry on deadlock mentions deadlock as a secondary error one should retry, in https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html (with serialization failure being the primary error that should trigger a transaction retry).

In Redshift, the only exception we can catch is "OTHERS", which is every kind of error.
Also, because it doesn't support sub-transactions, you can't actually retry in the procedure unless you use a nonatomic procedure (nonatomic procedure = every statement runs in its own transaction).
If I try to use a retry loop with an explicit transaction in a nonatomic procedure, I get this on deadlock:

ERROR:  CONTINUE HANDLER cannot be invoked from within an explicit transaction block

So we have to retry from the client.
I pushed a change to retry in go code with some logic copied from your pgbk retry logic.
Only for deadlock/serialization failure codes, because those are the retryable codes that will happen to us realistically I think.

@GavinFrazar GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch 2 times, most recently from dc90de3 to 82a09a0 Compare June 25, 2024 00:20
@GavinFrazar
Copy link
Contributor Author

@greedy52 since I added retry logic for the teardown code, please take another look.

lib/srv/db/postgres/users.go Show resolved Hide resolved
lib/srv/db/postgres/users.go Show resolved Hide resolved
@greedy52
Copy link
Contributor

@greedy52 since I added retry logic for the teardown code, please take another look.

should we retry on the activation as well

@GavinFrazar GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch from 82a09a0 to 4ec91d4 Compare June 25, 2024 19:09
@GavinFrazar
Copy link
Contributor Author

should we retry on the activation as well

done

@GavinFrazar
Copy link
Contributor Author

@espadolini this is ready for review again - I've added retries

e2e/aws/redshift_test.go Show resolved Hide resolved
lib/srv/db/postgres/users.go Show resolved Hide resolved
@GavinFrazar GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch from 4ec91d4 to 23eb1c2 Compare July 2, 2024 00:47
@GavinFrazar GavinFrazar enabled auto-merge July 2, 2024 00:55
@GavinFrazar GavinFrazar added this pull request to the merge queue Jul 2, 2024
Merged via the queue into master with commit f2c2e42 Jul 2, 2024
40 checks passed
@GavinFrazar GavinFrazar deleted the gavinfrazar/fix-redshift-deadlock branch July 2, 2024 01:24
@public-teleport-github-review-bot

@GavinFrazar See the table below for backport results.

Branch Result
branch/v14 Failed
branch/v15 Failed
branch/v16 Create PR

GavinFrazar added a commit that referenced this pull request Jul 9, 2024
github-merge-queue bot pushed a commit that referenced this pull request Jul 10, 2024
* add redshift cluster e2e tests (#41195)

* test redshift cluster iam role and existing db user iam auth
* test redshift cluster auto user provisioning
* add error checking in test cleanup

* fix redshift auto user deadlocking (#43335)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants