How to Access Production Data in Rails Migrations
Sooner or later every project will need to migrate production data, not only the schema. There are many different approaches in the wild – most of them overly complex. In this article we’ll tackle the problem using vanilla Rails.
Sane Data Migrations
In order to migrate data effectively we need to overcome several challanges:
- Accessing rows already present in the table.
- Avoiding breaking the migration with model and schema changes.
- Accessing rows from related tables.
- Updating columns.
All this is seemingly trivial but most projects I’ve seen got at least one of these wrong. The method we’ll describe is a good starting point even if your project is facing additional challanges.
The gist of the idea: create Active Record models inside the migration to access the required tables. Let’s illustrate this with an example.
Imagine we’re working on a multi-user blog system. Posts are currently accessed by ID. We’d like to start accessing them by a user-friendly slug based on the post title and author. An example title might look like ruby-on-rails-rocks-by-greg-navis
.
In order to migrate existing posts we need to do the following:
- Add
slug
toposts
. We must allowNULL
values for now as existing posts lack slugs. - Provide slugs for existing posts.
- Disallow
NULL
s inposts.slug
.
Let’s take a look at the migration first and discuss why things are the way they are afterwards:
class AddSlugsToPosts < ActiveRecord::Migration[5.2]
# First, we define migration models to manipulate the data.
class Post < ActiveRecord::Base
belongs_to :user
end
class User < ActiveRecord::Base
has_many :posts
end
def up
# Second, add the column and make it NULL-able.
add_column :posts, :slug, :string, null: true
# Third, ensure the new column is picked up by the model.
Post.reset_column_information!
Post.includes(:user).find_each do |post|
# Fourth, we set the slug and save the post.
post.slug = post_slug(post)
post.save!
end
# Fifth, make the column non-NULL.
change_colum_null :posts, :slug, false
end
def down
remove_column :posts, :slug
end
def post_slug(post)
# IMPORTANT: Copy the title-to-slug algorithm from the production model
# that was used AT THE TIME OF WRITING THE MIGRATION.
end
end
This migration may raise several questions:
- Why do we need to define models inside the migration?
- What’s
reset_column_information!
? - Isn’t copying the slug algorithm to the migration a violation of the DRY principle?
Let’s tackle these questions one-by-one.
First, we absolutely must not use the production Post
model as it would put us at risk of:
- Unknowingly executing callbacks that can trigger API calls, create other records, etc.
- Failing to save records due to validation errors.
- Breaking the migration after deploying changes to
Post
that are incompatible with the migration. For example, renaming the model would break the migration.
It’s also important that using production models in migrations may prevent us from restoring a backup. For instance, imagine we make a backup on Monday, run the migration on Tuesday, and delete the Post
model on Wednesday. Then on Thursday we need to restore Monday’s backup. We won’t be able to migrate to the most recent database schema as the Tuesday migration will fail because Post
is no longer present in the code base.
I recommend naming models after tables. For example, if the Post
model is backed up by the articles
table then the migration should define Article
instead of Post
. We should focus on the schema and the data while writing the migration and any non-obvious mapping from tables to models is a distraction.
After defining migration models, we can manipulate the schema. The rules of safe and efficient migrations are a topic for another article. In the example above, we need to make the column nil
-able before providing slugs for existing posts.
We also need to keep in mind that Active Record caches schema information so whenever we modify it we must force a refresh by calling .reset_column_information!
. It’s best to always call it after modifying a table even if theoretically our migration would work without it. The call is effectively free and there’s one less thing for us to think about.
After adding the column, we iterate over existing posts and set their slugs. What may seem controversial here is that we copied post_slug
to the migration. The reason for that is the same as the reason for having migration models – avoiding breaking migration with production code changes. As mentioned previously, this is especially important when restoring backups.
Last but not least, we are not violating DRY by replicating that code in the migration. It’s easy to see why if we spell out the definition of post_slug
in the migration and the production model:
post_slug
inPost
is the algorithm we’re using now.post_slug
in the migration is the algorithm we were using at the time migration was written.
The fact that the two definitions coincide while we’re writing the migration doesn’t mean they’re identical. They are not and trying to DRY up code like that is a mistake.
What Not to Do
The method above is sufficient in most cases experienced in practice. Let’s discuss what not to do before closing the topic:
- One-off rake tasks or scripts as they make it impossible to automate deployments and make it difficult to restore backups as we’d need to go through them again and run them and migrations in the right order. It doesn’t sound like something to do when restoring a backup.
- Custom commands run from a Rails console as this is even worse than rake tasks. There’s no auditiability of the changes plus we won’t remember what we did 2 weeks ago when restoring a backup.
- Separate data migration tracks (e.g. as implemented by
data-migrate
). They increase complexity without offering any benefits. In the end we need to run all the migrations in the right order and splitting them into multiple directories doesn’t help.
All these approaches require comparable effort, provide no benefit over our approach, and have serious downsides.
Summary
Migrating the data, as opposed to the schema, is a frequent problem that is often solved in overly complex ways. Using a regular migration with migration-specific model definitions allows us to use regular migrations, keep deployments automated and avoid complexity of third-party solutions.
Enjoyed the article? Follow me on Twitter!
I regularly post about Ruby, Ruby on Rails, PostgreSQL, and Hotwire.