Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Well-factored data migrations #88

Open
joshmfrankel opened this issue Dec 2, 2024 · 0 comments
Open

Well-factored data migrations #88

joshmfrankel opened this issue Dec 2, 2024 · 0 comments

Comments

@joshmfrankel
Copy link
Owner

joshmfrankel commented Dec 2, 2024

Large dataset migrations are always an adventure. A well factored data migration considers: Performance, Accuracy, Reversibility, and Repeatability. Each of these supports a successful data migration. Previously, I have determined these through running the SQL multiple times against a before and after state. While effective, it can get tedious, which is where testing comes into play.

Performance

You might be thinking performance == speed from my heading. The speed in which a data migration runs at can be important, but is not the most important factor. Generally, a data migration can be run once and at off-peak usage hours for an application. This is not exactly what I mean by performance.

Performance refers to overall application performance and health. For example, if running this data migration will lock database tables or cause a high frequency of reads / writes, it could as a side effect impact current users. This makes it essential for safe data migrations that reduce locking, reads, and writes.

Several techniques for safer migrations include:

  • Batch processing of multiple results
  • Bulk operations within batching
  • Working only with pertinent data by using effectively using SQL where clauses and db constraints
  • Delayed processing for sensitive performance concerns (e.g., slow batching)
  • Identifying efficient solutions by analyzing query planner cost and timing (e.g., EXPLAIN ANALYZE)

A data migration is no good however if it doesn't solve the underlying need accurately.

Accuracy

What is the before and after state of the database? What records do you need to work with to complete the migration? How will you know this was successful?

These are all questions to ask during a data migration process. A well factored data migration has the Engineer understanding the “Why” behind the use case. Did we add a new type of object users can interact with? Did the UI update a form field name, and we need to store existing and new data in it? Understanding why data moves between the before and after states, helps inform how you build the migration.

Part of maintaining accuracy, is ensuring you have a success metric. I always have some way of checking the dataset before and after the data migration runs to ensure all related records that require changing are changed correctly. This keeps the result accurate along with catching any edge cases that were missed. To go further, checking for accuracy several days / weeks after the data migration runs can be useful. Metabase is a great tool for creating reports that can indicate when data becomes out-of-sync.

Reversibility

If the worst happens and running your data migration causes a severe issue, preparing a backup plan can save you a lot of stress. Asking questions such as:

  • What happens if this changes the wrong records?
  • What if causes a database lock or worse a deadlock?

Knowing the worst-case scenario during a data migration can help inform your strategy to implement them. Spending the same amount of brainstorming on having a rollback strategy as the actual migration is a good rule that I like to follow.

Now, not all data migrations are reversible. Whenever possible, I'd recommend crafting migrations to be reversible. Even migrations that seem to be irreversible can be by creating artifacts during the up process of the migration. For example, storing IDs for the effected records or taking a database snapshot before running the backfill.

Much like rolling back an application deployment, having a backup plan can be the difference between a blip of downtime versus a multi-hour outage.

Repeatability

Being able to re-run the data migration multiple times, and it not have a multiplicative effect on the data, is incredibly useful for making the migration easy to use. This is also important in the case of missed edge cases where you need to re-run the data migration BUT don't necessarily want to impact any of the properly migrated records.

Keeping your data migration idempotent is another way of encapsulating the idea of repeatability. I should be able to run the migration many times and have a deterministic result. Deterministic, meaning predictable output.

So what?

Now that I've gone through my philosophy of well-factored data migrations, how can we make support this? Having to set up a before state, running the data migration, and then checking for successful after state works but can become tedious if each time you have to reset the data. This is where testing comes in, as it directly solves this problem. Generally, we don't think of migrations as testable, and most really aren't. Data migrations on the other hand are the exception and I believe SHOULD be tested.

Using RSpec we can generate a simple data migration spec which works by including the data migration directly in our rails_helper. The example below comes from me using the excellent gem data-migrate.

# db/data/123456789_backfill_new_column_on_users.rb
class BackfillNewColumnOnUsers < ActiveRecord::Migration
  def up
    # Backfills new column
  end

  def down
    # Reverts the backfill
  end
end

# spec/data/backfill_new_column_on_users_spec.rb
require "rails_helper"

RSpec.describe BackfillNewColumnOnUsers do
 let(:old_value) { "Old Value" }
 let(:updated_value) { "New Value" }

 it "syncs new column 'a' to 'b' (accuracy)" do
   user = create(:user, a: old_value)

   # Run the migration directly from the loaded migration files
   BackfillNewColumnOnTable.new.up

   # Ensure record is up-to-date post-migration
   user.reload

   expect(user.b).to eq updated_value
 end
 
 it "maintains idempotence (repeatability)" do
   user = create(:user, a: old_value)

   # Run twice to watch for errors, duplicate data, and invalid post-migration state
   BackfillNewColumnOnTable.new.up
   BackfillNewColumnOnTable.new.up

   # Ensure record is up-to-date post-migration
   user.reload

   expect(user.b).to eq updated_value
 end

 it "can be rolled back (reversibility)" do
   user = create(:user, a: old_value)

   BackfillNewColumnOnTable.new.up

   # Ensure record is up-to-date post-migration
   user.reload

   expect(user.b).to eq updated_value

   # Run the "down" portion of the migration
   BackfillNewColumnOnTable.new.down

   # Ensure record is up-to-date post-migration
   user.reload

   # Strict check
   expect(user.b).to be_nil

   # OR less strict
   expect(user.b).not_to eq updated_value
 end
end

# spec/rails_helper.rb
# Load all db/data classes
Dir[Rails.root.join('db/data/*.rb')].each { |f| require f }

Now assuming you have a setup / teardown phase configured in RSpec which resets your database, you now have functional spec coverage to ensure both Accuracy, Reversibility, and Repeatability. The idempotence block above is somewhat contrived but illustrates the idea of avoiding a migration causing side effects when repeated.

The two parts to remember from above are loading the migration and calling the migration's class.

# Loading migrations
Dir[Rails.root.join('db/data/*.rb')].each { |f| require f }

# Calling migrations
BackfillNewColumnOnTable.new.up
BackfillNewColumnOnTable.new.down

Performance is a bit trickier, as you'd need a dataset that is representative of production (ideally at 10x scale). For something like this, I'd recommend spec coverage which can be run on-demand instead of automatically. Fixtures and/or seeds can be beneficial in this case for generating a large volume of test data efficiently. You could then measure things like: query cost and query planner steps to determine success / failure. An easy step you can take here is to generate the raw SQL you plan on running and on a follow database running it with EXPLAIN ANALYZE to see what the resulting cost and query plan look like.

If you are using the data-migrate, I took an initial pass at adding first-class support for data migration testing support in this pull request. Hopefully, the proposed baked in setup will help others efficiently craft their own test coverage.

Got any tips for working with large data migrations? Maybe a slick AI tool you know of? Let me know in the comments below to continue the conversation.

@joshmfrankel joshmfrankel changed the title Testing a Data migration Well factored data migrations Dec 9, 2024
@joshmfrankel joshmfrankel changed the title Well factored data migrations Well-factored data migrations Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant