Exploration to reduce deadlocks #589

pushchris · 2024-12-21T20:58:13Z

Simple change to potentially reduce gap locks by not having to re-select records for update

leobarcellos · 2024-12-25T20:56:30Z

@pushchris I tried this PR but I also merged some other PR's (#585, #586, #587, #588) before updating.

On performance view, the queue number was getting higher and higher:

I tried modifying redis concurrency (10 -> 25) but with no success.

Then, I reverted our stack to the last PR before update (#581). Weirdly the queue task dropped to 3k, not sure what happened, queue throughput remained the same. (weird, maybe it was a bug on the performance view)

Really not sure what happened.

Btw, I will increase servers capacity (ec2 and rds) later to see if it will mitigate some of these errors, ec2 and rds instance are floating around 80% cpu capacity

leobarcellos · 2024-12-25T21:55:21Z

well, looks like it went back to 'normal'.
gonna try to update it back to the recent version

btw: I didnt scaled up servers yet.

leobarcellos · 2024-12-25T22:12:40Z

At least for now the job throughput is normalized, it should process everything soon.

(I still don't know why the queue size went back to 1Mi+ the moment I updated to the recent version, maybe there is a bug indeed on the performance view)

leobarcellos · 2024-12-26T19:15:30Z

Well, today is even weirder. 3Mi tasks, sometimes with good throughput, somestimes with bad.

I will revert it back again.

leobarcellos · 2024-12-26T19:33:15Z

Reverted and boom, magically 3.7mi tasks in queue reduced to 28k -- throughput went back to normal as well

pushchris · 2024-12-27T03:42:34Z

@leobarcellos my guess is that the list evaluation queue is literally never getting handled due to new priorities and your workers never getting to an empty queue state to handle. I've reverted that in this PR, the more I think about it the less having different priorities seems to be helpful and will mostly just cause issues. Can potentially re-evaluate moving everything to a lower priority queue and then selectively moving to high priority but even still that could cause lots of issues that are hard for folks to immediately identify. See if the latest commit to this PR stops the issue from happening for you

leobarcellos · 2024-12-27T16:24:43Z

@pushchris Thanks!

Updated it ~20min ago, looks good to me:

--- by the way, last deadlock still was on dec 25 00h, its 60hrs+ without new deadlocks. But to be honest I don't know if it's 100% related with this PR, since I went back and forth with this PR and still hadn't a new deadlock. I think it was the parameter tweaking and the index update on journey_user_stat

pushchris · 2024-12-28T21:45:10Z

@pushchris Thanks!

Updated it ~20min ago, looks good to me:
--- by the way, last deadlock still was on dec 25 00h, its 60hrs+ without new deadlocks. But to be honest I don't know if it's 100% related with this PR, since I went back and forth with this PR and still hadn't a new deadlock. I think it was the parameter tweaking and the index update on journey_user_stat

So bizarre on that index. In theory there is already an index on journey_id that the compiler can use so it shouldn't make a difference at all. Added it in to this PR and removed the merging bit. Would appreciate you giving it a whirl to see if any deadlocks come back. How many journeys do you have active in total?

pushchris · 2024-12-31T19:08:31Z

@leobarcellos how have things been looking? Still no deadlocks since that addition?

leobarcellos · 2025-01-02T15:49:24Z

@pushchris Hey there! Thanks for the update.

So, no more deadlocks found on journey_user_stat. Since then I received like 3 or 4 deadlock errors on campaign_sends, like these below. -- Which I think it's fine. I saw that you updated the failedStalled to update on chunks of 25, is the insertion doing in chunks as well?

About journeys, well, we do have a lot since there are many projects. Right now we have 165 journeys that are published and dont have the "deleted_at" column set. And the journey_user_step auto increment is at 47M and data length at 7.2G -- Would it be a good practice to have any sort of cleanup?

TRANSACTION 1962290683, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1128, 4 row lock(s)
MySQL thread id 572, OS thread handle 70398108542288, query id 343407676 172.30.6.37 overwatch update
insert ignore into `campaign_sends` (`campaign_id`, `send_at`, `state`, `user_id`) values (6081, '2024-12-31 12:00:00.000', 'pending', 432591), (6081, '2024-12-31 12:00:00.000', 'pending', 224919), (6081, '2024-12-31 12:00:00.000', 'pending', 355639), (6081, '2024-12-31 12:00:00.000', 'pending', 139211), (6081, '2024-12-31 12:00:00.000', 'pending', 459972), (6081, '2024-12-31 12:00:00.000', 'pending', 467801), (6081, '2024-12-31 12:00:00.000', 'pending', 103469), (6081, '2024-12-31 12:00:00.000', 'pending', 118135), (6081, '2024-12-31 12:00:00.000', 'pending', 467907), (6081, '2024-12-31 12:00:00.000', 'pending', 467848), (6081, '2024-12-31 12:00:00.000', 'pending', 301150), (6081, '2024-12-31 12:00:00.000', 'pending', 206261), (6081, '2024-12-31 12:00:00.000', 'pending', 108947), (6081, '2024-12-31 12:00:00.000', 'pending', 446684), (6081, '2024-12-31 12:00:00.000', 'pending', 468353), (6081, '2024-12-31 12:00:00.000', 'pending', 432605), (6081, '2024-12-31 12:00:00.000', 'pending', 459317), (6081, '2024-12-31 12:00:00.000', 'pending', 468309), (6081, '2024-12-31 12:00:00.000', 'pending', 128524), (6081, '2024-12-31 12:00:00.000', 'pending', 123728), (6081, '2024-12-31 12:00:00.000', 'pending', 135775), (6081, '2024-12-31 12:00:00.000', 'pending', 124162), (6081, '2024-12-31 12:00:00.000', 'pending', 224904), (6081, '2024-12-31 12:00:00.000', 'pending', 468527), (6081, '2024-12-31 12:00:00.000', 'pending', 453676), (6081, '2024-12-31 12:00:00.000', 'pending', 124042), (6081, '2024-12-31 12:00:00.000', 'pending', 452337), (6081, '2024-12-31 12:00:00.000', 'pending', 464192), (6081, '2024-12-31 12:00:00.000', 'pending', 466637), (6081, '2024-12-31 12:00:00.000', 'pending', 454985), (6081, '2024-12-31 12:00:00.000', 'pending', 469129), (6081, '2024-12-31 12:00:00.000', 'pending', 224927), (6081, '2024-12-31 12:00:00.000', 'pending', 224945), (6081, '2024-12-31 12:00:00.000', 'pending', 469637), (6081, '2024-12-31 12:00:00.000', 'pending', 467372), (6081, '2024-12-31 12:00:00.000', 'pending', 118347), (6081, '2024-12-31 12:00:00.000', 'pending', 207124), (6081, '2024-12-31 12:00:00.000', 'pending', 355637), (6081, '2024-12-31 12:00:00.000', 'pending', 108929), (6081, '2024-12-31 12:00:00.000', 'pending', 123812), (6081, '2024-12-31 12:00:00.000', 'pending', 136672), (6081, '2024-12-31 12:00:00.000', 'pending', 136584), (6081, '2024-12-31 12:00:00.000', 'pending', 128867), (6081, '2024-12-31 12:00:00.000', 'pending', 136201), (6081, '2024-12-31 12:00:00.000', 'pending', 103509), (6081, '2024-12-31 12:00:00.000', 'pending', 95433), (6081, '2024-12-31 12:00:00.000', 'pending', 260450), (6081, '2024-12-31 12:00:00.000', 'pending', 136332), (6081, '2024-12-31 12:00:00.000', 'pending', 136745), (6081, '2024-12-31 12:00:00.000', 'pending', 135761), (6081, '2024-12-31 12:00:00.000', 'pending', 207346), (6081, '2024-12-31 12:00:00.000', 'pending', 135745), (6081, '2024-12-31 12:00:00.000', 'pending', 136687), (6081, '2024-12-31 12:00:00.000', 'pending', 1087
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290683 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290683 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
TRANSACTION 1962290684, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1128, 4 row lock(s)
MySQL thread id 755, OS thread handle 70370664527184, query id 343407677 172.30.7.141 overwatch update
insert ignore into `campaign_sends` (`campaign_id`, `send_at`, `state`, `user_id`) values (6082, '2025-01-01 12:00:00.000', 'pending', 103538), (6082, '2025-01-01 12:00:00.000', 'pending', 128551), (6082, '2025-01-01 12:00:00.000', 'pending', 124095), (6082, '2025-01-01 12:00:00.000', 'pending', 118020), (6082, '2025-01-01 12:00:00.000', 'pending', 123851), (6082, '2025-01-01 12:00:00.000', 'pending', 128593), (6082, '2025-01-01 12:00:00.000', 'pending', 128711), (6082, '2025-01-01 12:00:00.000', 'pending', 128970), (6082, '2025-01-01 12:00:00.000', 'pending', 103556), (6082, '2025-01-01 12:00:00.000', 'pending', 128844), (6082, '2025-01-01 12:00:00.000', 'pending', 124347), (6082, '2025-01-01 12:00:00.000', 'pending', 124165), (6082, '2025-01-01 12:00:00.000', 'pending', 124281), (6082, '2025-01-01 12:00:00.000', 'pending', 124040), (6082, '2025-01-01 12:00:00.000', 'pending', 136625), (6082, '2025-01-01 12:00:00.000', 'pending', 128909), (6082, '2025-01-01 12:00:00.000', 'pending', 128798), (6082, '2025-01-01 12:00:00.000', 'pending', 128929), (6082, '2025-01-01 12:00:00.000', 'pending', 128644), (6082, '2025-01-01 12:00:00.000', 'pending', 124322), (6082, '2025-01-01 12:00:00.000', 'pending', 68711), (6082, '2025-01-01 12:00:00.000', 'pending', 103647), (6082, '2025-01-01 12:00:00.000', 'pending', 543918), (6082, '2025-01-01 12:00:00.000', 'pending', 124421), (6082, '2025-01-01 12:00:00.000', 'pending', 124301), (6082, '2025-01-01 12:00:00.000', 'pending', 128652), (6082, '2025-01-01 12:00:00.000', 'pending', 124397), (6082, '2025-01-01 12:00:00.000', 'pending', 103431), (6082, '2025-01-01 12:00:00.000', 'pending', 578676), (6082, '2025-01-01 12:00:00.000', 'pending', 124270), (6082, '2025-01-01 12:00:00.000', 'pending', 124329), (6082, '2025-01-01 12:00:00.000', 'pending', 128507), (6082, '2025-01-01 12:00:00.000', 'pending', 124172), (6082, '2025-01-01 12:00:00.000', 'pending', 108766), (6082, '2025-01-01 12:00:00.000', 'pending', 118310), (6082, '2025-01-01 12:00:00.000', 'pending', 124220), (6082, '2025-01-01 12:00:00.000', 'pending', 128499), (6082, '2025-01-01 12:00:00.000', 'pending', 124382), (6082, '2025-01-01 12:00:00.000', 'pending', 123975), (6082, '2025-01-01 12:00:00.000', 'pending', 124263), (6082, '2025-01-01 12:00:00.000', 'pending', 128841), (6082, '2025-01-01 12:00:00.000', 'pending', 123726), (6082, '2025-01-01 12:00:00.000', 'pending', 124425), (6082, '2025-01-01 12:00:00.000', 'pending', 207422), (6082, '2025-01-01 12:00:00.000', 'pending', 108824), (6082, '2025-01-01 12:00:00.000', 'pending', 591551), (6082, '2025-01-01 12:00:00.000', 'pending', 123952), (6082, '2025-01-01 12:00:00.000', 'pending', 103381), (6082, '2025-01-01 12:00:00.000', 'pending', 103321), (6082, '2025-01-01 12:00:00.000', 'pending', 103411), (6082, '2025-01-01 12:00:00.000', 'pending', 136000), (6082, '2025-01-01 12:00:00.000', 'pending', 118639), (6082, '2025-01-01 12:00:00.000', 'pending', 124003), (6082, '2025-01-01 12:00:00.000', 'pending', 1243
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290684 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290684 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
----------------------- END OF LOG ----------------------

Exploration to reduce deadlocks

011e93f

pushchris mentioned this pull request Dec 21, 2024

KnexTimeoutError: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? #584

Open

Remove priority

773d4a1

Adds index tweak

2d9ae3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exploration to reduce deadlocks #589

Exploration to reduce deadlocks #589

pushchris commented Dec 21, 2024

leobarcellos commented Dec 25, 2024

leobarcellos commented Dec 25, 2024 •

edited

Loading

leobarcellos commented Dec 25, 2024

leobarcellos commented Dec 26, 2024

leobarcellos commented Dec 26, 2024

pushchris commented Dec 27, 2024

leobarcellos commented Dec 27, 2024

pushchris commented Dec 28, 2024

pushchris commented Dec 31, 2024

leobarcellos commented Jan 2, 2025

Exploration to reduce deadlocks #589

Are you sure you want to change the base?

Exploration to reduce deadlocks #589

Conversation

pushchris commented Dec 21, 2024

leobarcellos commented Dec 25, 2024

leobarcellos commented Dec 25, 2024 • edited Loading

leobarcellos commented Dec 25, 2024

leobarcellos commented Dec 26, 2024

leobarcellos commented Dec 26, 2024

pushchris commented Dec 27, 2024

leobarcellos commented Dec 27, 2024

pushchris commented Dec 28, 2024

pushchris commented Dec 31, 2024

leobarcellos commented Jan 2, 2025

leobarcellos commented Dec 25, 2024 •

edited

Loading