Database Indexes As First-Class Citizens

I’ve found that working with CouchDB has increased my proficiency with MySQL. These are two completely different types of databases, so what could I possibly learn about one from using the other? One thing that both CouchDB and MySQL (or any other relational database) have in common is that they both use B-tree (technically B+ tree in CouchDB) indexes to make data retrieval more efficient. In CouchDB, it’s not possible to query against anything other than an index, whereas in MySQL you can create arbitrary SQL queries without having an index. In CouchDB, you manually build your indexes—called views—using MapReduce functions that are run incrementally against your documents. This gives you an incredible amount of flexibility, but does require some thought when designing your indexes/views. If you want to create an index in MySQL, you simply tell MySQL what column or combination of columns you want indexed and what type of index you want.

Working with CouchDB has taught me to treat database indexes as first-class citizens. A common approach to working with MySQL involves simply adding indexes where needed when your queries start to slow down. A better approach is to carefully consider the design of your indexes as part of your overall database design—not as an afterthought when your database starts to get slow.

The other day, I ran into a simple problem in MySQL that I solved differently then I would have before I started working with CouchDB. This was not a very complicated problem by any stretch of the imagination, but it does illustrate how CouchDB has influenced my approach to designing databases in MySQL. I wanted to track successful and failed login attempts by IP address. Here is the first table design that came to mind:

CREATE TABLE `login_attempt` (
 `ip_address` varchar(39) NOT NULL,
 `success` tinyint(3) unsigned NOT NULL default '0',
 `fail` tinyint(3) unsigned NOT NULL default '0',
 PRIMARY KEY  (`ip_address`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Notes:

The IP address field is varchar(39) to account for IPv6 addresses.
It would be more efficient to store IP addresses as integers. I could have used MySQL’s INET_ATON and INET_NTOA functions to convert IP addresses to and from their integer representations. However, these functions don’t support IPv6 yet so I didn’t bother.

This design is workable, but has a few potential problems:

Whenever I need to increment the `success` or `fail` fields, I would need to first check if a row exists yet for the IP address. This complicates the logic and creates unnecessary work for the database. I would have to do a SELECT and then either an INSERT or an UPDATE (or just try an INSERT and, if it fails, then go ahead and do an UPDATE—just as much work for the database).
Each update would temporarily lock the row that is being updated (since I’m using the InnoDB storage engine). In this scenario, it’s unlikely that multiple login attempts would be coming from the same IP address concurrently. However, it’s still a potential bottleneck and could be a problem in other use cases where rows might see more concurrent updates.

Here is the alternative table design:

CREATE TABLE `login_attempt` (
 `ip_address` varchar(39) NOT NULL,
 `success` tinyint(1) NOT NULL,
 `timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
 KEY `login_attempt_ip_address_success` (`ip_address`,`success`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

With this design, a new row is inserted for each login attempt (astute readers will notice that this is similar, conceptually, to Event Sourcing). Rows are treated as immutable and are never updated. The `success` column can have one of two values: 0 for false or 1 for true. An example of a query to find the number of successful and failed login attempts:

SELECT `success`, COUNT(*) AS `count`
FROM `login_attempt`
WHERE `ip_address`='127.0.0.1'
GROUP BY `success`;

An example of the results:

+---------+-------+
| success | count |
+---------+-------+
|       0 |    13 |
|       1 |    19 |
+---------+-------+

This tells us that there are 13 failed login attempts and 19 successful login attempts from the given IP address. Without an index on the combination of `ip_address` column (for the SELECT clause) and `success` column (for the GROUP BY clause), the above query would be extremely inefficient. However, with the correct indexing it's almost as efficient as the original approach with the added benefit of avoiding row-level locking on updates. This design has the added benefit of being able to "prune" login attempts older than a certain date (although I should index the `timestamp` field if I were to do that).

What do you think of this approach? The `login_attempt` table will certainly grow larger with this alternative approach (storing IP addresses as integers could help with this). Even with indexing, this alternative approach will probably be slightly less efficient than the original for reads (but more efficient for writes). Regardless, the point is that you should carefully design your indexes as part of your upfront database design as it could allow for alternatives that you might not have otherwise considered.

2 Comments

Dave Cottlehuber

Posted May 6, 2011 at 7:41 am | Permalink

Great post illustrating the different thinking in CouchDB – and how it’s not mandated just for NoSQL! Treating data as immutable events, and using incremental map-reduce makes it easy to add functionality at a later date – such as “failed logins in last 30 days”.

What other design patterns have you run into while using CouchDB?
Bradley Holt

Posted May 6, 2011 at 1:58 pm | Permalink

Dave, thanks for your comments! Generally speaking, CouchDB has instilled in me a greater appreciation for, and understanding of, database indexes. I can intuit much quicker now a good design for an index based on a given query. As for other design patterns, I will make another blog post if I think of any!

Bradley Holt