fix redshift auto user deadlocking #43335

GavinFrazar · 2024-06-21T03:53:50Z

Changelog: fixed Redshift auto-user deactivation/deletion failure that occurs when a user is created or deleted and another user is deactivated concurrently.

Fixes:

TestDatabases/redshift_cluster flakiness #41521

This should fix our e2e test failures, I think. I can't reproduce failures after this.
Despite the issue name, it's not a flaky test issue.

Context

Our e2e test for redshift seemed to be flaky, but it was actually catching a legit concurrency bug (well not really a bug - deadlocks are kinda inevitable for serializable isolation):

{"caller":"postgres/engine.go:142","component":"db:service","db":"ci-database-e2e-tests-redshif
t-cluster-us-west-2-307493967395","error":"ERROR: deadlock detected (SQLSTATE 40P01)","id":"c5f
031d4-1c88-4231-b9b5-0ff070b02e8f","level":"error","message":"Failed to teardown auto user.","t
imestamp":"2024-05-27T14:21:37-07:00"}

I determined that the culprit is likely from the deactivation and deletion scripts acquiring the same table locks in a different order. The teleport cluster semaphore we acquire is per-user, so with two users in the tests (auto_keep/auto_drop), the bug happens. I ran these tests on my own redshift cluster to reproduce even when only one test is running at a time, so it wasn't due to other tests interfering.

From experimentation, the following table locks are acquired by various operations in our sql procedures:

format: [locked table oid -> followed by another table lock oid] (statement responsible)

[1260 -> 16980] (create user)
[1260 -> 4771] (drop user) 
[1260] (alter user)
[4775 -> 4771] (grant role)
[4775 -> 4771] (revoke role)

* activate script:
** new user:
    [1260 -> 16980] (create user)
    [4775 -> 4771] (grant roles)
** reactivation:
  [calls deactivate] (deactivate the user)
  [1260] (alter user connection limit)
  [4775 -> 4771] (grant roles)
  
* deactivate script:
[4775 -> 4771] (revoke roles)
[1260] (alter user connection limit)

* delete script:
[1260 -> 4771] (drop user)
[calls deactivate] (fallback to deactivate user on deletion fail)

Procedures always run in a transaction, and locks are held until the transaction either commits or rolls back.
What was happening, I think, is the following:

deactivate script revokes roles (acquires 4771 lock, then 4775 lock)
delete script drops its user (acquires 1260, blocks on locked 4771)
deactivate script tries to alter user conn limit (blocks on locked 1260)
and now we have a deadlock due to 4771 -> 1260 and 1260 -> 4771 lock orders.

What this PR does

The fix is to alter the user conn limit first such that deactivate will acquire the locks [1260 -> 4771] in the same order as the delete script.
This also brings it in line with the order of acquisition for creating a user.

This does not fix deadlocking in the general case caused by external users/apps running queries concurrent with our procedures, which requires we implement retries. That goes for postgres and mysql too, probably. Postgres supports row level locking which Redshift doesn't - maybe that's why it doesn't fail. Or maybe it has a different isolation level default 🤷 . I am not going to try to fix the general case in this PR. The goal in this PR is to make our code not deadlock itself.

I added some logging that I found helpful to debug and also updated the e2e test to install the procedures in a randomized schema for each test, to avoid other tests from redefining the procedure before we call it.

note that test failures may still occur until this PR is backported, since older branch tests can still acquire the locks in a different order and deadlock a master branch test run, although this is far less likely given the timing necessary to hit the deadlock.

Edit: retry on deactivate/delete added as well. This affects only postgres/redshift databases.

espadolini

This fix seems incredibly fragile, shouldn't we be retrying deadlocked transactions? And if so, what's the problem?

lib/srv/db/postgres/sql/redshift-activate-user.sql

greedy52

the part i don't understand is why the deactivate/delete scripts don't deadlock with the activation script, which does revoke role -> alter user -> grant role.

Too bad Redshift does not support pg_advisory_lock, which would be a cool easy fix.

Not sure how LOCK works in Redshift. Seems you can explicitly call LOCK in the transaction in a specific order to the list of tables to avoid deadlock? Say lock tb1, tb2, tb3. (i did not test)

e2e/aws/redshift_test.go

GavinFrazar · 2024-06-21T18:59:23Z

the part i don't understand is why the deactivate/delete scripts don't deadlock with the activation script, which does revoke role -> alter user -> grant role.

reactivation cases calls the deactivate function first, so it acquires and holds all the locks in the same order as deactivate before doing anything else. The locks are not released after that deactivate call.
Prior to this PR, it probably could deadlock between activate and delete, as a consequence of deactivate deadlocking with delete, but this case was not happening under tests because the timing is just very unlikely I guess. Users activation and teardown is done for the two users under test in parallel, so we just don't see one user created while another is torn down.

Seems you can explicitly call LOCK in the transaction in a specific order to the list of tables to avoid deadlock? Say lock tb1, tb2, tb3. (i did not test)

I tried doing this but even as the db master user I could not lock the relevant tables - these oids correspond to pg_shadow, pg_identity, and a couple others.
There's also not a SELECT FOR UPDATE in redshift.

This fix seems incredibly fragile, shouldn't we be retrying deadlocked transactions? And if so, what's the problem?

As a fix to the general problem of deadlocking, yeah it's not a good fix. It does prevent teleport from deadlocking itself though, which is a nice "optimization", and it makes our failing e2e tests stop randomly deadlocking.

tbh I've avoided doing anything with retries because I wasn't sure how to go about that - like do we need some backoff logic? should it be done in the script or in go? If you have some advice here, even from a normal postgres perspective, please share and I'll add retries.

lib/srv/db/postgres/sql/redshift-deactivate-user.sql

espadolini · 2024-06-24T10:04:10Z

I think we should have at least a small handful of retries (and the transaction doing the user manipulation should probably run with serializable isolation?). I don't know if plpgsql is powerful enough to do that on its own or if we should do it from the client.

espadolini · 2024-06-24T10:29:14Z

The only recommendations about transaction retry on deadlock mentions deadlock as a secondary error one should retry, in https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html (with serialization failure being the primary error that should trigger a transaction retry).

GavinFrazar · 2024-06-25T00:06:34Z

and the transaction doing the user manipulation should probably run with serializable isolation

It's not clear to me what isolation level these procedures are using, based on the redshift docs, which seem to contradict themselves.

They say that databases by default use serializable isolation for all transactions.
Then elsewhere they say snapshot isolation (equivalent to repeatable read in postgres, I think) is the default.

Then BEGIN reference says: https://docs.aws.amazon.com/redshift/latest/dg/r_BEGIN.html

Though you can use any of the four transaction isolation levels, Amazon Redshift processes all isolation levels as serializable.

then there's this:
serializable isolation docs https://docs.aws.amazon.com/redshift/latest/dg/c_serial_isolation.html:

System catalog tables (PG) and other Amazon Redshift system tables (STL and STV) are not locked in a transaction. Therefore, changes to database objects that arise from DDL and TRUNCATE operations are visible on commit to any concurrent transactions.
...
Transactions for updates to these tables run in a read committed isolation mode. PG-prefix catalog tables don't support snapshot isolation.

but that's a lie. It does lock, and that lock blocks other create/drop user operations. Changes are not visible on commit. I tested this by opening two transactions, dropping a user in one and committing, then checking users again. The user still appears to exist in the other transaction.

It's a mess - I think it's using serializable isolation, but idk.

The only recommendations about transaction retry on deadlock mentions deadlock as a secondary error one should retry, in https://www.postgresql.org/docs/current/mvcc-serialization-failure-handling.html (with serialization failure being the primary error that should trigger a transaction retry).

In Redshift, the only exception we can catch is "OTHERS", which is every kind of error.
Also, because it doesn't support sub-transactions, you can't actually retry in the procedure unless you use a nonatomic procedure (nonatomic procedure = every statement runs in its own transaction).
If I try to use a retry loop with an explicit transaction in a nonatomic procedure, I get this on deadlock:

ERROR:  CONTINUE HANDLER cannot be invoked from within an explicit transaction block

So we have to retry from the client.
I pushed a change to retry in go code with some logic copied from your pgbk retry logic.
Only for deadlock/serialization failure codes, because those are the retryable codes that will happen to us realistically I think.

GavinFrazar · 2024-06-25T00:22:18Z

@greedy52 since I added retry logic for the teardown code, please take another look.

lib/srv/db/postgres/users.go

greedy52 · 2024-06-25T14:28:16Z

@greedy52 since I added retry logic for the teardown code, please take another look.

should we retry on the activation as well

GavinFrazar · 2024-06-25T19:09:30Z

should we retry on the activation as well

done

GavinFrazar · 2024-06-30T02:56:30Z

@espadolini this is ready for review again - I've added retries

e2e/aws/redshift_test.go

lib/srv/db/postgres/users.go

public-teleport-github-review-bot · 2024-07-02T01:25:51Z

@GavinFrazar See the table below for backport results.

Branch	Result
branch/v14	Failed
branch/v15	Failed
branch/v16	Create PR

* add redshift cluster e2e tests (#41195) * test redshift cluster iam role and existing db user iam auth * test redshift cluster auto user provisioning * add error checking in test cleanup * fix redshift auto user deadlocking (#43335)

GavinFrazar added backport/branch/v14 backport/branch/v15 backport/branch/v16 labels Jun 21, 2024

GavinFrazar requested a review from greedy52 June 21, 2024 03:53

github-actions bot requested a review from probakowski June 21, 2024 03:54

github-actions bot added database-access Database access related issues and PRs size/sm labels Jun 21, 2024

GavinFrazar requested review from espadolini and gabrielcorado and removed request for probakowski June 21, 2024 03:57

espadolini reviewed Jun 21, 2024

View reviewed changes

lib/srv/db/postgres/sql/redshift-activate-user.sql Outdated Show resolved Hide resolved

greedy52 reviewed Jun 21, 2024

View reviewed changes

e2e/aws/redshift_test.go Show resolved Hide resolved

greedy52 approved these changes Jun 21, 2024

View reviewed changes

greedy52 reviewed Jun 21, 2024

View reviewed changes

lib/srv/db/postgres/sql/redshift-deactivate-user.sql Show resolved Hide resolved

GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch from 1e3305a to 1af3bb4 Compare June 25, 2024 00:02

GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch 2 times, most recently from dc90de3 to 82a09a0 Compare June 25, 2024 00:20

GavinFrazar requested review from greedy52 and espadolini June 25, 2024 00:21

greedy52 approved these changes Jun 25, 2024

View reviewed changes

lib/srv/db/postgres/users.go Show resolved Hide resolved

lib/srv/db/postgres/users.go Show resolved Hide resolved

GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch from 82a09a0 to 4ec91d4 Compare June 25, 2024 19:09

gabrielcorado approved these changes Jun 26, 2024

View reviewed changes

espadolini approved these changes Jul 1, 2024

View reviewed changes

e2e/aws/redshift_test.go Show resolved Hide resolved

lib/srv/db/postgres/users.go Show resolved Hide resolved

fix redshift auto user deadlocking

23eb1c2

GavinFrazar force-pushed the gavinfrazar/fix-redshift-deadlock branch from 4ec91d4 to 23eb1c2 Compare July 2, 2024 00:47

GavinFrazar enabled auto-merge July 2, 2024 00:55

GavinFrazar added this pull request to the merge queue Jul 2, 2024

Merged via the queue into master with commit f2c2e42 Jul 2, 2024
40 checks passed

GavinFrazar deleted the gavinfrazar/fix-redshift-deadlock branch July 2, 2024 01:24

Tener mentioned this pull request Jul 5, 2024

convert db lib to use slog, part 2 #43849

Merged

GavinFrazar mentioned this pull request Jul 9, 2024

[v16] fix redshift auto user deadlocking #43968

Merged

This was referenced Jul 9, 2024

[v15] e2e test redshift cluster #43975

Merged

[v14] e2e test redshift cluster #43984

Merged

GavinFrazar added a commit that referenced this pull request Jul 9, 2024

fix redshift auto user deadlocking (#43335)

2c21d4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix redshift auto user deadlocking #43335

fix redshift auto user deadlocking #43335

GavinFrazar commented Jun 21, 2024 •

edited

Loading

espadolini left a comment

greedy52 left a comment •

edited

Loading

GavinFrazar commented Jun 21, 2024 •

edited

Loading

espadolini commented Jun 24, 2024

espadolini commented Jun 24, 2024

GavinFrazar commented Jun 25, 2024 •

edited

Loading

GavinFrazar commented Jun 25, 2024

greedy52 commented Jun 25, 2024

GavinFrazar commented Jun 25, 2024

GavinFrazar commented Jun 30, 2024

public-teleport-github-review-bot bot commented Jul 2, 2024

fix redshift auto user deadlocking #43335

fix redshift auto user deadlocking #43335

Conversation

GavinFrazar commented Jun 21, 2024 • edited Loading

Context

What this PR does

espadolini left a comment

Choose a reason for hiding this comment

greedy52 left a comment • edited Loading

Choose a reason for hiding this comment

GavinFrazar commented Jun 21, 2024 • edited Loading

espadolini commented Jun 24, 2024

espadolini commented Jun 24, 2024

GavinFrazar commented Jun 25, 2024 • edited Loading

GavinFrazar commented Jun 25, 2024

greedy52 commented Jun 25, 2024

GavinFrazar commented Jun 25, 2024

GavinFrazar commented Jun 30, 2024

public-teleport-github-review-bot bot commented Jul 2, 2024

GavinFrazar commented Jun 21, 2024 •

edited

Loading

greedy52 left a comment •

edited

Loading

GavinFrazar commented Jun 21, 2024 •

edited

Loading

GavinFrazar commented Jun 25, 2024 •

edited

Loading