Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding Postgres transaction deadlocks #268

Open
zeeshanakram3 opened this issue Dec 4, 2023 · 5 comments
Open

Avoiding Postgres transaction deadlocks #268

zeeshanakram3 opened this issue Dec 4, 2023 · 5 comments

Comments

@zeeshanakram3
Copy link
Contributor

zeeshanakram3 commented Dec 4, 2023

Problem

From the Postgres documentation, why the deadlocks occur:

Consider the case in which two concurrent transactions modify a table. The first transaction executes:

UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 11111;

This acquires a row-level lock on the row with the specified account number. Then, the second transaction executes:

UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 22222;
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 11111;

The first UPDATE statement successfully acquires a row-level lock on the specified row, so it succeeds in updating that row. However, the second UPDATE statement finds that the row it is attempting to update has already been locked, so it waits for the transaction that acquired the lock to complete. Transaction two is now waiting on transaction one to complete before it continues execution. Now, transaction one executes:

UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222;

Transaction one attempts to acquire a row-level lock on the specified row, but it cannot: transaction two already holds such a lock. So it waits for transaction two to complete. Thus, transaction one is blocked on transaction two, and transaction two is
blocked on transaction one: a deadlock condition. PostgreSQL will detect this situation and abort one of the transactions.

How to fix deadlocks

The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order. In the example above, if both transactions had updated the rows in the same order, no deadlock would have occurred.

e.g. based on the above example, the following order of transactions execution won't cause any deadlock

-- Transaction 1
BEGIN;
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 11111; 

-- Transaction 2
BEGIN;
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 11111; 

-- Transaction 1
UPDATE accounts SET balance = balance - 100.00 WHERE acctnum = 22222; 
COMMIT;

-- Transaction 2
UPDATE accounts SET balance = balance + 100.00 WHERE acctnum = 22222;
COMMIT;

Avoiding deadlocks in Orion code

  1. Identify all the processes/applications that are concurrently updating the Postgres DB
  2. Ensure that the queries those concurrent applications execute have all the entity updates in the same order (i.e. ascending)

For example, we know that at least two different connections are responsible for concurrent updates,

  1. The processor that sends a bunch of update requests to DB in a single transaction
  2. The updateVideoRelevanceValue function that periodically updates the video_relevance scores for all videos.

So taking the example of only the video table, when the processor submits a transaction to the DB to update all the videos (updates caused due to processing of mappings over N blocks), we need to ensure that both transactions - the processor transaction and the video relevance transaction - has the video entities updates in the ascending order of their IDs.

@ignazio-bovo
Copy link

ignazio-bovo commented Dec 4, 2023

There are several instances of deadlock that I have occurred for a deadlock:

Conflicting updates on overlay flush

error

This suggest a possible conflict between two update made during the overlay flush operation, in which internally the EntityManager save operation is called. Reading the message it says that it's happening when 2 TX are updating the same row 446881 on the video table and they want to update their Share lock into an Exclusive lock. This upgrade cannot be controlled via Typeorm (it could be mitigated by having a TX committing one row at a time instead of multiple row as above as recommended here)
But the most direct solution would be to catch the deadlock error and retry like it is done here on the typeorm documentation (https://typeorm.io/transactions)

Conflicting updates on the updateVideoRelevanceValue loop

Video relavence is update via transaction here


This is a big transaction that relies on postgres autocommit feature (i.e. transaction query is not wrapped in BEGING + COMMIT statements). I tried to add them to the and re run the synch process and I am getting weird errors like

image

Without them appears to work smoothly but it's not reliable as I have seen the deadlock error happened on production deployment (but I haven't save the screeshot).
Hence this update loop is quite hard to debug and unpredictable, thus:

I solution could be adding a separate table
(videoId, videoRelevance) in order to READ from the video table and WRITE on the relevance table
A same reasoning could be applied to the comment relevance update loop. In this way the competing transactions actually operates on two different rows when updating the same video id

@ignazio-bovo
Copy link

ignazio-bovo commented Dec 4, 2023

Overlay

The overlay feature was provided as a in memory cache to avoid constantly querying the database in order to speed up the processor synching and block processing.
We had so far several issue with concurrency with the overlay, but they are not directly related to the overlay functionality per se, but more on how it handles concurrent transactions via the entity manager instance.
There's no clean way to turn it off / on and at this point we should live with it, as I believe it is useful and it is just a matter of retouching the flush operations via transactional statements with proper isolation level
But at this point it is extremely inconvenient to abandon the overlay

@zeeshanakram3
Copy link
Contributor Author

zeeshanakram3 commented Dec 4, 2023

I think that for this type of information in order to simplify everything it could be worth adding a separate table
(videoId, videoRelevance) in order to READ from the video table and WRITE on the relevance table
A same reasoning could be applied to the comment relevance update loop

I think doing schema change does not really either address the problem at its code or fix the problem, it only moves the possibility of deadlock happening from video to new video_relevance table (If two concurrent processes try to update the video_relevance table) .

The solution I provided (as well as suggested in the Postgres official documentation) The best defense against deadlocks is generally to avoid them by being certain that all applications using a database acquire locks on multiple objects in a consistent order. is the best way to avoid deadlock scenarios from ever happening I think.

image

So coming to the actual solution, I see that the processor's save transaction already tries to submit the UPDATE queries in ascending order of the IDs (see the above screenshot). Now only change you have to do this to to make sure the updateVideoRelevanceValue also tries to submit the update video relevance score in the same order. This is how the updated function should look like:

async updateVideoRelevanceValue(em: EntityManager, forceUpdateAll?: boolean) {
 if (this.videosToUpdate.size || forceUpdateAll) {
   const [
     newnessWeight,
     viewsWeight,
     commentsWeight,
     reactionsWeight,
     [joystreamTimestampWeight, ytTimestampWeight] = [7, 3],
     defaultChannelWeight,
   ] = await config.get(ConfigVariable.RelevanceWeights, em);
   const channelWeight = defaultChannelWeight ?? 1;
   await em.query(`
     WITH weighted_timestamp AS (
       SELECT 
         "video"."id",
         (
           extract(epoch from video.created_at)*${joystreamTimestampWeight} +
           COALESCE(extract(epoch from video.published_before_joystream), extract(epoch from video.created_at))*${ytTimestampWeight}
         ) / ${joystreamTimestampWeight} + ${ytTimestampWeight} as wtEpoch,
         "channel"."channel_weight" as CW
       FROM 
         "video" 
         INNER JOIN
           "channel" ON "video"."channel_id" = "channel"."id"
       ${
         forceUpdateAll
           ? ''
           : `WHERE "video"."id" IN (${[...this.videosToUpdate.values()]
               .map((id) => `'${id}'`)
               .join(', ')})`
       }
+        ORDER BY "video"."id" -- Added ORDER BY clause to sort by ID
     )
     UPDATE 
       "video"
     SET
       "video_relevance" = ROUND(
         (
           (extract(epoch from now()) - wtEpoch) / ${NEWNESS_SECONDS_DIVIDER} * ${newnessWeight * -1} +
           (views_num * ${viewsWeight}) +
           (comments_count * ${commentsWeight}) +
           (reactions_count * ${reactionsWeight})
         ) * COALESCE(CW, ${channelWeight}),
         2)
     FROM
       weighted_timestamp
     WHERE
       "video".id = weighted_timestamp.id
   `);
   this.videosToUpdate.clear();
 }
}

@ignazio-bovo
Copy link

ignazio-bovo commented Dec 5, 2023

The ORDER BY clause is not allowed in the postgresql (at least the way you placed it)

The issue here, as I mentioned, are two:

Potential conflicts in between Overlay save operation on entities

In the screenshot presented there are 2 TX both holding a shared lock on the videoId highlited. This is happening basically:

  1. Tx1 acquires Shared lock (i.e. read only lock)
  2. Tx2 acquires Shared lock
  3. Tx1 wants to upgrade to Execute lock (i.e. write lock), thus it doesn't release the Shared lock but it waits for all the other Shared locks to be released
  4. Tx2 wants to upgrade to Execute lock (i.e. write lock), thus it doesn't release the Shared lock but it waits for all the other Shared locks to be released

The result is that Tx1 and Tx2 are waiting for each other.
There's not really much we can do about this, unless having explicitly in the typeorm code deadlock detection (i.e. a try-catch block) with a retry policy

Concurrency with Upgrade query

The concurrency here is given by :
a. Overlay that wants to update the video table
b. Update query that wants to update the video table just for the relevancy score

What I am suggesting is that we split the two tables so that Overlay operates on any table except the relevancy score one and the Update loop query operates solely on the relevancy score.

This should solve any concurrency issues between overlay and update query, which is what we are trying to solve here

@zeeshanakram3
Copy link
Contributor Author

The ORDER BY clause is not allowed in the postgresql (at least the way you placed it)

Correct, I have updated the query

Potential conflicts in between Overlay save operation on entities
...thus it doesn't release the Shared lock but it waits for all the other Shared locks to be released

I think not entirely accurate since as many transactions can hold Shared lock(read only locks)

You are right that the conflict is happening b/w Overlay and Video relevance score update transactions, call them Tx1 and Tx2, respectively. Now on my attempt to explaining how adding ORDER BY should fix the deadlock, say Tx1 would update videos Id [1, 10] while Tx2 would update [10, 1] (basically same videos but in different order)

  1. Tx1 starts, it wants to acquire row exclusive lock on video ID 1, there is no other write only lock on the 1 so it acquires the lock successfully and makes the update
  2. Tx2 starts, it wants to acquire row exclusive lock on video ID 10, there is no other write only lock on the 10 so it acquires the lock successfully and makes the update
  3. Tx1 wants to acquires row exclusive lock on video ID 10, but Tx2 already holds lock on that row, so Tx1 will wait for Tx2 to complete and release the lock
  4. Tx2 wants to acquires row exclusive lock on video ID 1, but Tx1 already holds lock on that row, so Tx2 will wait for Tx1 to complete and release the lock

Now, you see that after step 4, both these transaction are waiting for each other (i.e. cyclic dependency), so the Postgres will detect that and abort one of the transaction.

There's not really much we can do about this, unless having explicitly in the typeorm code deadlock detection (i.e. a try-catch block) with a retry policy

This is not entirely true, we can prevent any deadlock scenario from happening in the first place. Considering above example, if the Tx2 had updated the rows in the same order as Tx1. Tx2 would actually never have started in the first place till Tx1 was done

The concurrency here is given by :
a. Overlay that wants to update the video table
b. Update query that wants to update the video table just for the relevancy score

Correct.

What I am suggesting is that we split the two tables so that Overlay operates on any table except the relevancy score one and the Update loop query operates solely on the relevancy score.

Yeah, this should work, but I don't think we should employ that, as this seems like an anti-pattern and this is not usually deadlock scenarios are fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants