Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: auto-inc lock deadlocks? #27

Open
tiegz opened this issue Feb 25, 2013 · 22 comments
Open

Discussion: auto-inc lock deadlocks? #27

tiegz opened this issue Feb 25, 2013 · 22 comments

Comments

@tiegz
Copy link
Contributor

tiegz commented Feb 25, 2013

Curious if anyone else has experienced a deadlock during an LHM migration where, for instance, on a busy table Transaction1 (replace into...) is waiting for Transaction2 (insert into...) with lock mode AUTO-INC and the transactions get rolled back?

In the past we've seen this once, and the most obvious answer seems that it's just an auto-increment lock deadlock. However, for most cases you're copying over the id field from table-to-table, which is typically an auto-increment field.

To fix this one could remove the auto-increment from that field before the LHM, and add it back to the new table's field right before the table switch. Another option would be to set innodb_autoinc_lock_mode = 2 in mysql temporarily until the table switch.

My question is, would this be a good configurable option for LHM, since it applies to a typical database setup?

@tiegz
Copy link
Contributor Author

tiegz commented Feb 26, 2013

Ah, these two solutions didn't seem to work for me:

  1. removing/re-adding the AUTO_INCREMENT didn't work because it triggers a table rebuild when you re-add it with something like ALTER TABLE lhmn_somethings MODIFYidint(11) NOT NULL AUTO_INCREMENT.
  2. the innodb_autoinc_lock_mode setting seems to be global afaik, affecting all the tables.

@grobie
Copy link
Contributor

grobie commented Feb 26, 2013

We haven't encountered any deadlock scenarios so far. In case your assumption is correct that this is related to auto increment locks, it's likely due to the fact that we run our mysql cluster with innodb_autoinc_lock_mode = 2 and our replication in mixed mode. This means that insert statements are replicated via row-based replication to ensure consistency.

Are you able to reproduce this scenario? Can you think about any changes to generated queries in order to avoid such deadlock scenarios?
What do you think about a) documenting this problem and/or b) adding a check which would abort the migration in case people try to run Lhm with a problematic configuration?

@tiegz
Copy link
Contributor Author

tiegz commented Feb 26, 2013

Ah, that's interesting that you guys set innodb_autoinc_lock_mode = 2! Seems like in most cases it should work for most apps... maybe I'll look at that as an option.

I haven't been able to reproduce the scenario at all :\ It's always happened on a specific large table, and only near the end of the migration.

I initially thought it might have something to do with those last few chunks being written to frequently because they're newer, but it turns out it's choking on records that are a couple hours old (and probably not being touched often during the migration).

I like the idea of documenting this, although it seems like we should hold off until someone else has the same problem. LHM has worked great otherwise, and I haven't found anything online from people with a similar issue yet.

@jacobwgillespie
Copy link

Any updates on this? I'm experiencing the same error: https://app.getsentry.com/playlist/production/group/17127363/. It seems fairly reproducible. I have innodb_autoinc_lock_mode = 2 set.

@jacobwgillespie
Copy link

Since Sentry seems to be changing that error page, this was the error:

ActiveRecord::StatementInvalid: Mysql2::Error: Deadlock found when trying to get lock; try restarting transaction: UPDATE `users` SET `last_sign_in_at` = '2014-03-25 21:33:10', `current_sign_in_at` = '2014-03-28 04:11:17', `sign_in_count` = 5, `updated_at` = '2014-03-28 04:11:17' WHERE `users`.`id` = 186955

https://app.getsentry.com/playlist/production/group/17127363/events/951399630/

@tiegz
Copy link
Contributor Author

tiegz commented Mar 28, 2014

@JacobWG we dropped the table on which this was happening, but haven't seen it since then [yet].

@jacobwgillespie
Copy link

Dropped and readded, or the table's no longer necessary to your app? Like, did you fix the deadlock somehow or did it just become irrelevant?

@tiegz
Copy link
Contributor Author

tiegz commented Mar 28, 2014

Nah, the table became irrelevant so we just got rid of it.

@jacobwgillespie
Copy link

Okay - for now, I'm experimenting with https://github.com/qertoip/transaction_retry, so we'll see how far that gets.

@tiegz
Copy link
Contributor Author

tiegz commented Mar 28, 2014

GL, hope it works well! That reminds me, we did end up monkey-patching LHM to retry deadlocks around the same time, but I'm not sure how often it's worked since then (this is also rescuing a different error than the gem you mentioned):

class Lhm::Chunker
  def update_with_retries(statements)
    retried = 0
    max_retries = 2

    begin
      update_without_retries(statements)
    rescue => e
      if e.message =~ /Deadlock found when trying to get lock/ && retried.to_i < max_retries
        puts "Caught and retrying error: #{e.message}"
        retried += 1
        sleep 1
        retry
      else
        raise e
      end
    end
  end
  alias_method_chain :update, :retries
end

@jacobwgillespie
Copy link

Thanks for that! I may try it if the gem doesn't work out...

@eljuanchosf
Copy link

@jacobwgillespie Hi!! I've been having the same issue. Did the transaction_retry gem solve your problem?

@jacobwgillespie
Copy link

Yep, the gem's been working for me - I'm logging the warnings about transaction retries to the error monitor (Bugsnag) and then gem's taking care of them nicely.

@eljuanchosf
Copy link

@jacobwgillespie Thank you so much for the quick response. :)

@jacobwgillespie
Copy link

👍

@tillvollmer
Copy link

Hi,we have the same problem:
ActiveRecord::StatementInvalid: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: insert ignore into lhmn_idea_stubs (comment, icon, id, idea_id, idea_image_id, idea_parent_id, idea_rank, link, note, parent_id, rank, style, task_effort, task_from, task_resource, task_resource_id, task_until, title, user_id, x, y) select idea_stubs.comment, idea_stubs.icon, idea_stubs.id, idea_stubs.idea_id, idea_stubs.idea_image_id, idea_stubs.idea_parent_id, idea_stubs.idea_rank, idea_stubs.link, idea_stubs.note, idea_stubs.parent_id, idea_stubs.rank, idea_stubs.style, idea_stubs.task_effort, idea_stubs.task_from, idea_stubs.task_resource, idea_stubs.task_resource_id, idea_stubs.task_until, idea_stubs.title, idea_stubs.user_id, idea_stubs.x, idea_stubs.y from idea_stubs where idea_stubs.id between 87643613 and 87683612

It's not at the end of the migration (we have 1.9 B entries) and this happens on 87 Mio. But the table is very heavily used. Data is only added and not changed at all, but entries are deleted.

We did test it on our test system with live data, so when no load is on it works.

Any ideas what we can do? We soon reach the integer boundary for our ID and need to migrate to bigint ;-)

@jacobwgillespie
Copy link

If the transaction_retry gem doesn't work for you, you could always set up a parallel table and some MySQL triggers to propagate changes from your primary table to the new one while you're moving data over...

For an even more creative solution, you could try switching from InnoDB to TokuDB and take advantage of the hot schema updates to add a new column for the new ID and then rename it once it's ready to function as the ID (while data is migrating with an INSERT query, you can adjust your rails model with a before_save callback to make sure the new_id column gets populated).

@tillvollmer
Copy link

Thank you.
I am trying transaction_retry right now, so far it works. 500 Mio...
I just wonder why this even happens? For me a deadlock means that 2 transactions are running and inside the transaction they wait for the data of the other transaction. Why is that even happening here?

@MaxPower15
Copy link

We ran into this issue when migrating a table with about 6,000,000 rows. The database globally gets about 25k calls per minute from the rails app. It's weighted toward reads, but the table in question has more frequent writes than any other in our database.

Here is the deadlock as saved from show engine innodb status\G (some info redacted):

------------------------
LATEST DETECTED DEADLOCK
------------------------
141210 19:00:49
*** (1) TRANSACTION:
TRANSACTION 683EA904E, ACTIVE 36 sec, process no 2463, OS thread id 140008101451520 setting auto-inc lock
mysql tables in use 2, locked 2
LOCK WAIT 23376 lock struct(s), heap size 2570224, 52414 row lock(s), undo log entries 31659
MySQL thread id 6776603, query id 45279258810 xx.xxx.xx.xxx wistia update
replace into `lhmn_table_name` (`account_id`, `contact_id`, `created_at`, `created_by_id`, `created_by_type`, `deleted_at`, `description`, `hashed_id`, `id`, `media_group_id`, `name`, `position`, `preview_id`, `project_id`, `protected`, `public_id`, `type`, `updated_at`) /* large hadron migration */
values (NEW.`account_id`, NEW.`contact_id`, NEW.`created_at`, NEW.`created_by_id`, NEW.`created_by_type`, NEW.`deleted_at`, NEW.`description`, NEW.`hashed_id`, NEW.`id`, NEW.`media_group_id`, NEW.`name`, NEW.`position`, NEW.`preview_id`, NEW.`project_id`, NEW.`protected`, NEW.`public_id`, NEW.`type`, NEW.`updated_at`)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
TABLE LOCK table `database_name`.`lhmn_table_name` trx id 683EA904E lock mode AUTO-INC waiting
*** (2) TRANSACTION:
TRANSACTION 683EA964E, ACTIVE 34 sec, process no 2463, OS thread id 140007875163904 inserting
mysql tables in use 2, locked 2
11 lock struct(s), heap size 3024, 498 row lock(s), undo log entries 492
MySQL thread id 6792689, query id 45279210114 xx.xxx.xxx.xxx wistia Sending data
insert ignore into `lhmn_table_name` (`account_id`, `contact_id`, `created_at`, `created_by_id`, `created_by_type`, `deleted_at`, `description`, `hashed_id`, `id`, `media_group_id`, `name`, `position`, `preview_id`, `project_id`, `protected`, `public_id`, `type`, `updated_at`) select table_name.`account_id`, table_name.`contact_id`, table_name.`created_at`, table_name.`created_by_id`, table_name.`created_by_type`, table_name.`deleted_at`, table_name.`description`, table_name.`hashed_id`, table_name.`id`, table_name.`media_group_id`, table_name.`name`, table_name.`position`, table_name.`preview_id`, table_name.`project_id`, table_name.`protected`, table_name.`public_id`, table_name.`type`, table_name.`updated_at` from `table_name` where table_name.`id` between 5140001 and 5145000
*** (2) HOLDS THE LOCK(S):
TABLE LOCK table `database_name`.`lhmn_table_name` trx id 683EA964E lock mode AUTO-INC
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 347 page no 25942 n bits 824 index `index_table_name_on_hashed_id` of table `database_name`.`lhmn_table_name` trx id 683EA964E lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 129 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 10; hex 76626d6b3934666b6a7a; asc vbmk94fkjz;;
 1: len 4; hex 801c374f; asc   7O;;

*** WE ROLL BACK TRANSACTION (2)

With the default stride of 40,000 and throttle of 0.1 seconds, we got this error (approximately--it's always on a different row) very quickly. With a stride of 5,000 and throttle of 0.8 seconds, it takes ~20 minutes on average before it occurs. It seems very random though, maybe based on user behavior. My understanding, maybe incorrect, is that the batch insert acquires a gap lock on the temporary table, and at the same time, a trigger tries updating a row in that batch.

I am not an expert on this topic, but it seemed like this Percona post explains the problem and some potential solutions: http://www.percona.com/blog/2012/03/27/innodbs-gap-locks/. However, our database runs at the isolation level READ-REPEATABLE, and we were unable to change this quickly because our slave replication setup (i.e. bin_log=STATEMENT) requires mysql consistency guarantees that cannot be had below that isolation level.

Worth noting that, when this deadlock occurs for us, it seems like it also holds a lock on the non-lhm table, which in turn crashes our app. I'm not sure why that happens, but it's scary and counter to lhm's core promise of safe non-blocking migrations.

Thank you to those above who recommended the transaction_retry gem and the lhm retry monkey patch. We'll give those a shot. If that fails, we may need to change our replication setup and isolation level for this migration.

I'm not sure lhm is doing anything wrong here, but it might be good to document the issue at least.

EDIT: The transaction_retry gem did not fix this for us unfortunately. We saw two deadlocks while performing the migration. It looks like the first one successfully recovered, and the second one hung the migration entirely and brought down the app. We are giving up on the retry strategy for now and are looking into changing both our replication and isolation level setups so that this can work.

@arthurnn
Copy link
Contributor

I beleive this is fixed on master, as we now have a code to retry if deadlocks are found #93 .

@meunierd
Copy link

@arthurnn I don't think that PR actually addresses the deadlock issue. That only affects the switcher and the LHMs discussed in this issue fail well before the table switch. Could this be re-opened?

@avit
Copy link

avit commented Dec 23, 2015

See also: #107?

@arthurnn arthurnn reopened this Jan 4, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

9 participants