Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exploration to reduce deadlocks #589

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pushchris
Copy link
Contributor

Simple change to potentially reduce gap locks by not having to re-select records for update

@leobarcellos
Copy link
Contributor

@pushchris I tried this PR but I also merged some other PR's (#585, #586, #587, #588) before updating.

On performance view, the queue number was getting higher and higher:

Screenshot 2024-12-25 at 5 02 23 PM Screenshot 2024-12-25 at 5 28 48 PM

I tried modifying redis concurrency (10 -> 25) but with no success.

Then, I reverted our stack to the last PR before update (#581). Weirdly the queue task dropped to 3k, not sure what happened, queue throughput remained the same. (weird, maybe it was a bug on the performance view)

Screenshot 2024-12-25 at 5 49 19 PM

Really not sure what happened.

Btw, I will increase servers capacity (ec2 and rds) later to see if it will mitigate some of these errors, ec2 and rds instance are floating around 80% cpu capacity

@leobarcellos
Copy link
Contributor

leobarcellos commented Dec 25, 2024

well, looks like it went back to 'normal'.
gonna try to update it back to the recent version

Screenshot 2024-12-25 at 6 54 25 PM

btw: I didnt scaled up servers yet.

@leobarcellos
Copy link
Contributor

At least for now the job throughput is normalized, it should process everything soon.

(I still don't know why the queue size went back to 1Mi+ the moment I updated to the recent version, maybe there is a bug indeed on the performance view)

Screenshot 2024-12-25 at 6 59 32 PM Screenshot 2024-12-25 at 7 10 55 PM

@leobarcellos
Copy link
Contributor

Well, today is even weirder. 3Mi tasks, sometimes with good throughput, somestimes with bad.

Screenshot 2024-12-26 at 4 14 33 PM

I will revert it back again.

@leobarcellos
Copy link
Contributor

Screenshot 2024-12-26 at 4 32 22 PM

Reverted and boom, magically 3.7mi tasks in queue reduced to 28k -- throughput went back to normal as well

@pushchris
Copy link
Contributor Author

@leobarcellos my guess is that the list evaluation queue is literally never getting handled due to new priorities and your workers never getting to an empty queue state to handle. I've reverted that in this PR, the more I think about it the less having different priorities seems to be helpful and will mostly just cause issues. Can potentially re-evaluate moving everything to a lower priority queue and then selectively moving to high priority but even still that could cause lots of issues that are hard for folks to immediately identify. See if the latest commit to this PR stops the issue from happening for you

@leobarcellos
Copy link
Contributor

@pushchris Thanks!

Updated it ~20min ago, looks good to me:

Screenshot 2024-12-27 at 1 21 58 PM

--- by the way, last deadlock still was on dec 25 00h, its 60hrs+ without new deadlocks. But to be honest I don't know if it's 100% related with this PR, since I went back and forth with this PR and still hadn't a new deadlock. I think it was the parameter tweaking and the index update on journey_user_stat

@pushchris
Copy link
Contributor Author

@pushchris Thanks!

Updated it ~20min ago, looks good to me:

Screenshot 2024-12-27 at 1 21 58 PM --- by the way, last deadlock still was on dec 25 00h, its 60hrs+ without new deadlocks. But to be honest I don't know if it's 100% related with this PR, since I went back and forth with this PR and still hadn't a new deadlock. I think it was the parameter tweaking and the index update on journey_user_stat

So bizarre on that index. In theory there is already an index on journey_id that the compiler can use so it shouldn't make a difference at all. Added it in to this PR and removed the merging bit. Would appreciate you giving it a whirl to see if any deadlocks come back. How many journeys do you have active in total?

@pushchris
Copy link
Contributor Author

@leobarcellos how have things been looking? Still no deadlocks since that addition?

@leobarcellos
Copy link
Contributor

@pushchris Hey there! Thanks for the update.

So, no more deadlocks found on journey_user_stat. Since then I received like 3 or 4 deadlock errors on campaign_sends, like these below. -- Which I think it's fine. I saw that you updated the failedStalled to update on chunks of 25, is the insertion doing in chunks as well?

About journeys, well, we do have a lot since there are many projects. Right now we have 165 journeys that are published and dont have the "deleted_at" column set. And the journey_user_step auto increment is at 47M and data length at 7.2G -- Would it be a good practice to have any sort of cleanup?

TRANSACTION 1962290683, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1128, 4 row lock(s)
MySQL thread id 572, OS thread handle 70398108542288, query id 343407676 172.30.6.37 overwatch update
insert ignore into `campaign_sends` (`campaign_id`, `send_at`, `state`, `user_id`) values (6081, '2024-12-31 12:00:00.000', 'pending', 432591), (6081, '2024-12-31 12:00:00.000', 'pending', 224919), (6081, '2024-12-31 12:00:00.000', 'pending', 355639), (6081, '2024-12-31 12:00:00.000', 'pending', 139211), (6081, '2024-12-31 12:00:00.000', 'pending', 459972), (6081, '2024-12-31 12:00:00.000', 'pending', 467801), (6081, '2024-12-31 12:00:00.000', 'pending', 103469), (6081, '2024-12-31 12:00:00.000', 'pending', 118135), (6081, '2024-12-31 12:00:00.000', 'pending', 467907), (6081, '2024-12-31 12:00:00.000', 'pending', 467848), (6081, '2024-12-31 12:00:00.000', 'pending', 301150), (6081, '2024-12-31 12:00:00.000', 'pending', 206261), (6081, '2024-12-31 12:00:00.000', 'pending', 108947), (6081, '2024-12-31 12:00:00.000', 'pending', 446684), (6081, '2024-12-31 12:00:00.000', 'pending', 468353), (6081, '2024-12-31 12:00:00.000', 'pending', 432605), (6081, '2024-12-31 12:00:00.000', 'pending', 459317), (6081, '2024-12-31 12:00:00.000', 'pending', 468309), (6081, '2024-12-31 12:00:00.000', 'pending', 128524), (6081, '2024-12-31 12:00:00.000', 'pending', 123728), (6081, '2024-12-31 12:00:00.000', 'pending', 135775), (6081, '2024-12-31 12:00:00.000', 'pending', 124162), (6081, '2024-12-31 12:00:00.000', 'pending', 224904), (6081, '2024-12-31 12:00:00.000', 'pending', 468527), (6081, '2024-12-31 12:00:00.000', 'pending', 453676), (6081, '2024-12-31 12:00:00.000', 'pending', 124042), (6081, '2024-12-31 12:00:00.000', 'pending', 452337), (6081, '2024-12-31 12:00:00.000', 'pending', 464192), (6081, '2024-12-31 12:00:00.000', 'pending', 466637), (6081, '2024-12-31 12:00:00.000', 'pending', 454985), (6081, '2024-12-31 12:00:00.000', 'pending', 469129), (6081, '2024-12-31 12:00:00.000', 'pending', 224927), (6081, '2024-12-31 12:00:00.000', 'pending', 224945), (6081, '2024-12-31 12:00:00.000', 'pending', 469637), (6081, '2024-12-31 12:00:00.000', 'pending', 467372), (6081, '2024-12-31 12:00:00.000', 'pending', 118347), (6081, '2024-12-31 12:00:00.000', 'pending', 207124), (6081, '2024-12-31 12:00:00.000', 'pending', 355637), (6081, '2024-12-31 12:00:00.000', 'pending', 108929), (6081, '2024-12-31 12:00:00.000', 'pending', 123812), (6081, '2024-12-31 12:00:00.000', 'pending', 136672), (6081, '2024-12-31 12:00:00.000', 'pending', 136584), (6081, '2024-12-31 12:00:00.000', 'pending', 128867), (6081, '2024-12-31 12:00:00.000', 'pending', 136201), (6081, '2024-12-31 12:00:00.000', 'pending', 103509), (6081, '2024-12-31 12:00:00.000', 'pending', 95433), (6081, '2024-12-31 12:00:00.000', 'pending', 260450), (6081, '2024-12-31 12:00:00.000', 'pending', 136332), (6081, '2024-12-31 12:00:00.000', 'pending', 136745), (6081, '2024-12-31 12:00:00.000', 'pending', 135761), (6081, '2024-12-31 12:00:00.000', 'pending', 207346), (6081, '2024-12-31 12:00:00.000', 'pending', 135745), (6081, '2024-12-31 12:00:00.000', 'pending', 136687), (6081, '2024-12-31 12:00:00.000', 'pending', 1087
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290683 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290683 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
TRANSACTION 1962290684, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 7 lock struct(s), heap size 1128, 4 row lock(s)
MySQL thread id 755, OS thread handle 70370664527184, query id 343407677 172.30.7.141 overwatch update
insert ignore into `campaign_sends` (`campaign_id`, `send_at`, `state`, `user_id`) values (6082, '2025-01-01 12:00:00.000', 'pending', 103538), (6082, '2025-01-01 12:00:00.000', 'pending', 128551), (6082, '2025-01-01 12:00:00.000', 'pending', 124095), (6082, '2025-01-01 12:00:00.000', 'pending', 118020), (6082, '2025-01-01 12:00:00.000', 'pending', 123851), (6082, '2025-01-01 12:00:00.000', 'pending', 128593), (6082, '2025-01-01 12:00:00.000', 'pending', 128711), (6082, '2025-01-01 12:00:00.000', 'pending', 128970), (6082, '2025-01-01 12:00:00.000', 'pending', 103556), (6082, '2025-01-01 12:00:00.000', 'pending', 128844), (6082, '2025-01-01 12:00:00.000', 'pending', 124347), (6082, '2025-01-01 12:00:00.000', 'pending', 124165), (6082, '2025-01-01 12:00:00.000', 'pending', 124281), (6082, '2025-01-01 12:00:00.000', 'pending', 124040), (6082, '2025-01-01 12:00:00.000', 'pending', 136625), (6082, '2025-01-01 12:00:00.000', 'pending', 128909), (6082, '2025-01-01 12:00:00.000', 'pending', 128798), (6082, '2025-01-01 12:00:00.000', 'pending', 128929), (6082, '2025-01-01 12:00:00.000', 'pending', 128644), (6082, '2025-01-01 12:00:00.000', 'pending', 124322), (6082, '2025-01-01 12:00:00.000', 'pending', 68711), (6082, '2025-01-01 12:00:00.000', 'pending', 103647), (6082, '2025-01-01 12:00:00.000', 'pending', 543918), (6082, '2025-01-01 12:00:00.000', 'pending', 124421), (6082, '2025-01-01 12:00:00.000', 'pending', 124301), (6082, '2025-01-01 12:00:00.000', 'pending', 128652), (6082, '2025-01-01 12:00:00.000', 'pending', 124397), (6082, '2025-01-01 12:00:00.000', 'pending', 103431), (6082, '2025-01-01 12:00:00.000', 'pending', 578676), (6082, '2025-01-01 12:00:00.000', 'pending', 124270), (6082, '2025-01-01 12:00:00.000', 'pending', 124329), (6082, '2025-01-01 12:00:00.000', 'pending', 128507), (6082, '2025-01-01 12:00:00.000', 'pending', 124172), (6082, '2025-01-01 12:00:00.000', 'pending', 108766), (6082, '2025-01-01 12:00:00.000', 'pending', 118310), (6082, '2025-01-01 12:00:00.000', 'pending', 124220), (6082, '2025-01-01 12:00:00.000', 'pending', 128499), (6082, '2025-01-01 12:00:00.000', 'pending', 124382), (6082, '2025-01-01 12:00:00.000', 'pending', 123975), (6082, '2025-01-01 12:00:00.000', 'pending', 124263), (6082, '2025-01-01 12:00:00.000', 'pending', 128841), (6082, '2025-01-01 12:00:00.000', 'pending', 123726), (6082, '2025-01-01 12:00:00.000', 'pending', 124425), (6082, '2025-01-01 12:00:00.000', 'pending', 207422), (6082, '2025-01-01 12:00:00.000', 'pending', 108824), (6082, '2025-01-01 12:00:00.000', 'pending', 591551), (6082, '2025-01-01 12:00:00.000', 'pending', 123952), (6082, '2025-01-01 12:00:00.000', 'pending', 103381), (6082, '2025-01-01 12:00:00.000', 'pending', 103321), (6082, '2025-01-01 12:00:00.000', 'pending', 103411), (6082, '2025-01-01 12:00:00.000', 'pending', 136000), (6082, '2025-01-01 12:00:00.000', 'pending', 118639), (6082, '2025-01-01 12:00:00.000', 'pending', 124003), (6082, '2025-01-01 12:00:00.000', 'pending', 1243
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290684 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
RECORD LOCKS space id 49 page no 154162 n bits 328 index PRIMARY of table `parcelvoy`.`campaign_sends` trx id 1962290684 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
0: len 8; hex 73757072656d756d; asc supremum;;
----------------------- END OF LOG ----------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants