Database Performance: Redundant Indexes

Databases use sophisticated planning algorithms to determine best query execution strategy. An important aspect is deciding which indexes, if any, to use. This article explores why some indexes could be replaced by other existing indexes and how to identify them.

What Is an Index?

Ordinary database indexes are implement using B+ trees. We aren’t going to delve into their definition and implementation as it’d detract us from the goal of understanding why some indexes may be redundant. For the purposes of the article, you can think of an index as:

  1. A copy of the original table …
  2. containing a subset of its columns …
  3. along with a reference to the original row …
  4. and stored in sorted order.

This model is good enough to understand the principle behind redundant. Let’s work through an example demonstrating when one index can make another redundant. We’ll create a multi-column index and show how it can replace a single-column index.

Multi-column Index Enters the Scene

Imagine we’re working on a marketplace app where buyers and sellers can transact. Each transaction is recorded in the transactions table along with references to the buyer and seller. The figure below presents an example table.

id seller_id buyer_id status created_at
1 3 2 draft 2021-12-08 19:55:35
2 1 2 finalized 2021-12-08 20:18:38
3 2 3 aborted 2021-12-08 20:30:07
4 2 1 finalized 2021-12-08 21:38:38
5 1 3 finalized 2021-12-08 21:54:37
The table storing marketplace transactions. Note id is sorted but other columns are not.

One of requirements is that the seller should be able to list all transactions with a specific buyer. The list should be paginated and sorted starting with the most recent transaction. The query to fetch a page of results could look like the one below.

SELECT     *
  FROM     transactions
  WHERE    seller_id = 1 AND buyer_id = 2
  ORDER BY created_at DESC
  LIMIT    20
  OFFSET   40

transactions is going to contain millions of rows and if it’s unindexed then the query is going to require a slow sequential table scan. In order to improve performance, we could create a specialized three-column index on buyer_id, seller_id and created_at (in descending order).

CREATE INDEX transactions_seller_id_buyer_id_created_at_desc_idx
  ON transactions(buyer_id, seller_id, created_at DESC)

This index is likely to improve performance significantly. In my synthetic benchmarks on a table of 10 million rows adding the index reduced execution time from 8,000 ms to less than 1 ms.

Replacing a Single-column Index

Browsing transactions with a specific buyer is a frequent use case but certainly not the only one. There are other less frequent uses cases that need to access seller’s transactions. Those use cases are almost certain to select rows by seller_id and thus would benefit from an index on that column. The good news is we might be able to reuse the index created previously – no new index would be necessary.

To understand why, let’s visualize the index from the previous section as a table.

seller_id buyer_id created_at id
1 3 2021-12-08 21:54:37 5
1 2 2021-12-08 20:18:38 2
2 1 2021-12-08 21:38:38 4
2 3 2021-12-08 20:30:07 3
3 2 2021-12-08 19:55:35 1
A simplified representation of the index. id represents a reference to the original row in transactions.

We can immediately see that seller_id is already sorted. When executing a query, the database can turn that multi-column index into a single-column index on seller_id by pretending it contains no other columns. It is as if we’ve already created an index on seller_id.

It’s also evident buyer_id is NOT sorted so it’s not true any column would work. There’s a slightly more complicated rule at play. In general, a multi-column index on column1, column2, …, columnN can be used as an index on any prefix of the column list, that is:

  • column1
  • column1, column2
  • column1, column2, …, column[N - 1]

There are many other concerns we need to keep in mind though. For example, the reasoning above breaks down for unique indexes – a unique index on users(email, api_key) guarantees that a pair of values is unique but it’s still possible to have the same email with different API keys. If email must be unique then a dedicated unique index is necessary.

We won’t go into further details as we wanted to understand the basic principle in action. Time to find its practical applications.

Redundant Indexes in Practice

Let’s start by devising an algorithm for identifying redundant indexes. In order to do so, we need to be able to tell whether one index can be replaced by another. We’ll approach the problem backwards and assume we’ve already implemented a method called replacable_with?(index1, index2) that determines whether index1 can be replaced with index2.

With replacable_with? available for use, it’s easy to identify redundant indexes: those are indexes that can be replaced with another index. The example code below demonstrates how we could approach identifying them in Ruby.

redundant_indexes = indexes.reject do |index|
  indexes.any? do |other_index|
    index != other_index && replaceable_with?(index, other_index)
  end
end

That snippet, though simple, makes some implicit assumptions. It assumes there are no loops, i.e. there’s no situation where A replaces B and B replaces A. Under this limitation, we’ve just reduced the problem to defining and implementing replacable_with?.

In the previous section, we touched upon the fact a multi-column index can serve as an index on any prefix of its column list. A more sophisticated definition of replacable_with? is a topic for another article but the snippet in the figure below takes this one step further by taking uniqueness into account. Keep in mind there are many other properties that may affect whether one index can replace another – operator classes, partiality, index type (e.g. B+ tree, hash) and others.

def replaceable_with?(index1, index2)
  case [index1.unique, index2.unique]
  when [true, true]
    # If indexes are unique then must cover identical columns to be replaceable.
    index1.columns == index2.columns

  when [true, false]
    # Unique indexes cannot be replaced by non-unique ones.
    false

  else
    # In other cases check whether columns in index1 are a prefix of columns in
    # index2.
    prefix?(index1.columns, index2.columns)
  end
end

The beauty of this approach is that if the algorithm produces incorrect results (either false positives or false negatives) then it’s enough to refine replacable_with?.

Redundant Indexes in Rails Apps

If your app is Rails based (or Active Record-based in general) then you can take this one step further and use active_record_doctor to identify potentially redundant indexes.

All it takes is adding active_record_doctor to Gemfile, installing it and running:

bundle exec rails active_record_doctor:extraneous_indexes

The output will list redundant indexes (each on a separate line) along with any indexes they could be replaced with. For example:

remove index_users_on_last_name - can be replaced by index_users_on_last_name_and_first_name_and_email or unique_index_on_users_last_name_and_first_name

In the example above, index_users_on_last_name can be removed in favor of the other indexes mentioned on the right.

In case you wanted to retain some of the reported indexes then you can configure active_record_doctor to either skip individual indexes or all indexes on individual tables. Please consult the documentation for more information.

Closing Thoughts

Indexing is both a science and an art. A complex database schema usually comes with comprehensive index coverage and simple performance tuning optimizations are hidden there. Realizing how multi-column indexes work could help save disk space and improve write performance without harming read performance.

If you’re working on a Rails app then I recommend you give active_record_doctor a try.

Leave your email below to regularly receive artices on building excellent products, code, and teams.

Leave your email to receive updates about articles.