Database Performance: Deletion Timestamps and Indexing

Deletion timestamps are often used to implement soft-deletion by marking rows deleted without actually discarding them. While simple, this approach has consequences for index performance. Let’s understand the problem and how partial indexes help to to solve it.

Problem Statement

Suppose we’re working on a job board where hiring managers can post and delete job ads. One of the requirements is that deleting an ad should stop showing it to candidates but the ad itself must be retained for archival, analytics, and other purposes. Deletion is entirely manual and is triggered by hiring managers. One way of meeting that requirement is adding a deleted_at column to the job_ads table and disallowing users from seeing ads with non-NULL deleted_at.

That approach, though simple, comes with performance implications for queries on active job ads. The fact that the number of deleted job ads is ever increasing can cause tables and indexes to be dominated by data irrelevant to users.

As an example, consider a database query to find all active jobs matching a set of keywords. Such query could be implemented using PostgreSQL full-text search capabilities, including a specialized index. Normally, proper indexing boosts read performance but in our case results can be unsatisfactory. To understand why, imagine the database contains 500k job ads but only 2k are active (498k are deleted). Even if the query uses the index it’ll go through hundreds of thousands of irrelevant ads. The index will contain mostly irrelevant data and it’ll get worse whenever an ad is deleted.

Fortunately, there’s an easy solution to the problem.

Partial Indexes to the Rescue

The solution is to create a partial index, a special index covering a subset of rows, instead of a full index. A partial index is created when a WHERE clause is provided to CREATE INDEX – that clause is used to determine which rows the index should cover.

The command to create a partial full-text GIN index to support the example query mentioned above can look like this:

CREATE INDEX job_ads_description_ft_idx ON job_ads
    USING GIN(to_tsvector('english', description))
    WHERE deleted_at IS NULL;

After the index is there, we can use EXPLAIN to ensure the planner is going to use it:

                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on job_ads  (cost=70.86..11274.25 rows=6046 width=22) (actual time=0.027..0.050 rows=0 loops=1)
   Recheck Cond: ((to_tsvector('english'::regconfig, description) @@ '''ruby'''::tsquery) AND (deleted_at IS NULL))
   ->  Bitmap Index Scan on job_ads_description_ft_idx  (cost=0.00..69.35 rows=6046 width=0) (actual time=0.023..0.045 rows=0 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, description) @@ '''ruby'''::tsquery)

The index is indeed in use and there’s no extra filtering step.

Compare the situation above with a full index and a separate filtering step. The query plan looks like this:

                                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on job_ads  (cost=6009.92..188010.45 rows=606045 width=22) (actual time=66.824..106.347 rows=604718 loops=1)
   Recheck Cond: (to_tsvector('english'::regconfig, description) @@ '''python'''::tsquery)
   Filter: (deleted_at IS NULL)
   Rows Removed by Filter: 44181
   Heap Blocks: exact=8752
   ->  Bitmap Index Scan on job_ads_description_ft_idx  (cost=0.00..5858.41 rows=651522 width=0) (actual time=65.658..65.658 rows=648899 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, description) @@ '''python'''::tsquery)

The row reading Filter: (deleted_at IS NULL) indicates there was an extra filtering step that removed over 44k rows in our example data set.

Using a rudimentary synthetic benchmark, I was able to increase query performance from 170 ms to 75 ms. Naturally, it doesn’t mean your app will be able to reap comparable benefits. The point is there’s much room for improvement if you attack the problem with the right tools.

Using `active_record_doctor`

It’s easy to mistakenly create a full index on a table using deletion timestamps. That’s why I added a check to active_record_doctor that ensure all indexes on such tables are manually reviewed. The decision whether to create a full or partial index requires judgment and cannot be automated. Fortunately, automation can bring such cases to developer’s attention and let him decide on a case-by-case basis.

If you use Active Record then you can install active_record_doctor and then run:

bundle exec rake active_record_doctor:unindexed_deleted_at

The output will look like this:

consider adding `WHERE deleted_at IS NULL` to job_ads_description_ft_idx - a partial index can speed lookups of soft-deletable models

You should then decide whether the indicated indexes should be converted into partial indexes. If not, you should ignore them via the .active_record_doctor configuration file:

ActiveRecordDoctor.configure do
  detector :unindexed_deleted_at,
    ignore_indexes: ["job_ads_description_ft_idx"]
end

Rerunning active_record_doctor should not report that index again.

Closing Thoughts

Deletion timestamps (but also deletion flags) are used frequently without taking their performance implications into account. And that’s understandable as it’s unrealistic to expect developers to take all potential side effects of schema changes in systems comprised of hundreds of tables. Fortunately, a little bit of automation can greatly increase awareness and help to identify performance pitfalls before they hit production.

If you use Active Record (with or without Ruby on Rails), I recommend you give active_record_doctor a try.

I regularly post about Ruby, Ruby on Rails, PostgreSQL, and Hotwire.

Follow @gregnavis