How to Access Production Data in Rails Migrations

Sooner or later every project will need to migrate production data, not only the schema. There are many different approaches in the wild – most of them overly complex. In this article we’ll tackle the problem using vanilla Rails.

Sane Data Migrations

In order to migrate data effectively we need to overcome several challanges:

Accessing rows already present in the table.
Avoiding breaking the migration with model and schema changes.
Accessing rows from related tables.
Updating columns.

All this is seemingly trivial but most projects I’ve seen got at least one of these wrong. The method we’ll describe is a good starting point even if your project is facing additional challanges.

The gist of the idea: create Active Record models inside the migration to access the required tables. Let’s illustrate this with an example.

Imagine we’re working on a multi-user blog system. Posts are currently accessed by ID. We’d like to start accessing them by a user-friendly slug based on the post title and author. An example title might look like ruby-on-rails-rocks-by-greg-navis.

In order to migrate existing posts we need to do the following:

Add slug to posts. We must allow NULL values for now as existing posts lack slugs.
Provide slugs for existing posts.
Disallow NULLs in posts.slug.

Let’s take a look at the migration first and discuss why things are the way they are afterwards:

class AddSlugsToPosts < ActiveRecord::Migration[5.2]
  # First, we define migration models to manipulate the data.
  class Post < ActiveRecord::Base
    belongs_to :user
  end

  class User < ActiveRecord::Base
    has_many :posts
  end

  def up
    # Second, add the column and make it NULL-able.
    add_column :posts, :slug, :string, null: true

    # Third, ensure the new column is picked up by the model.
    Post.reset_column_information!

    Post.includes(:user).find_each do |post|
      # Fourth, we set the slug and save the post.
      post.slug = post_slug(post)
      post.save!
    end

    # Fifth, make the column non-NULL.
    change_colum_null :posts, :slug, false
  end

  def down
    remove_column :posts, :slug
  end

  def post_slug(post)
    # IMPORTANT: Copy the title-to-slug algorithm from the production model
    #            that was used AT THE TIME OF WRITING THE MIGRATION.
  end
end

This migration may raise several questions:

Why do we need to define models inside the migration?
What’s reset_column_information!?
Isn’t copying the slug algorithm to the migration a violation of the DRY principle?

Let’s tackle these questions one-by-one.

First, we absolutely must not use the production Post model as it would put us at risk of:

Unknowingly executing callbacks that can trigger API calls, create other records, etc.
Failing to save records due to validation errors.
Breaking the migration after deploying changes to Post that are incompatible with the migration. For example, renaming the model would break the migration.

It’s also important that using production models in migrations may prevent us from restoring a backup. For instance, imagine we make a backup on Monday, run the migration on Tuesday, and delete the Post model on Wednesday. Then on Thursday we need to restore Monday’s backup. We won’t be able to migrate to the most recent database schema as the Tuesday migration will fail because Post is no longer present in the code base.

I recommend naming models after tables. For example, if the Post model is backed up by the articles table then the migration should define Article instead of Post. We should focus on the schema and the data while writing the migration and any non-obvious mapping from tables to models is a distraction.

After defining migration models, we can manipulate the schema. The rules of safe and efficient migrations are a topic for another article. In the example above, we need to make the column nil-able before providing slugs for existing posts.

We also need to keep in mind that Active Record caches schema information so whenever we modify it we must force a refresh by calling .reset_column_information!. It’s best to always call it after modifying a table even if theoretically our migration would work without it. The call is effectively free and there’s one less thing for us to think about.

After adding the column, we iterate over existing posts and set their slugs. What may seem controversial here is that we copied post_slug to the migration. The reason for that is the same as the reason for having migration models – avoiding breaking migration with production code changes. As mentioned previously, this is especially important when restoring backups.

Last but not least, we are not violating DRY by replicating that code in the migration. It’s easy to see why if we spell out the definition of post_slug in the migration and the production model:

post_slug in Post is the algorithm we’re using now.
post_slug in the migration is the algorithm we were using at the time migration was written.

The fact that the two definitions coincide while we’re writing the migration doesn’t mean they’re identical. They are not and trying to DRY up code like that is a mistake.

What Not to Do

The method above is sufficient in most cases experienced in practice. Let’s discuss what not to do before closing the topic:

One-off rake tasks or scripts as they make it impossible to automate deployments and make it difficult to restore backups as we’d need to go through them again and run them and migrations in the right order. It doesn’t sound like something to do when restoring a backup.
Custom commands run from a Rails console as this is even worse than rake tasks. There’s no auditiability of the changes plus we won’t remember what we did 2 weeks ago when restoring a backup.
Separate data migration tracks (e.g. as implemented by data-migrate). They increase complexity without offering any benefits. In the end we need to run all the migrations in the right order and splitting them into multiple directories doesn’t help.

All these approaches require comparable effort, provide no benefit over our approach, and have serious downsides.

Summary

Migrating the data, as opposed to the schema, is a frequent problem that is often solved in overly complex ways. Using a regular migration with migration-specific model definitions allows us to use regular migrations, keep deployments automated and avoid complexity of third-party solutions.

I regularly post about Ruby, Ruby on Rails, PostgreSQL, and Hotwire.

Follow @gregnavis