Addressing the NoSQL Criticism

There were quite a few NoSQL critics at OSCON this year. I imagine this was true of past years as well, but I don’t know that first hand. I think there are several reasons behind the general disdain for NoSQL databases.

First, NoSQL is horrible name. It implies that there’s something wrong with SQL and it needs to be replaced with a newer and better technology. If you have structured data that needs to be queried, you should probably use a database that enforces a schema and implements Structured Query Language. I’ve heard people start redefining NoSQL as “not only SQL”. This is a much better definition and doesn’t antagonize those who use existing SQL databases. An SQL database isn’t always the right tool for the job and NoSQL databases give us some other options.

Second, there are way too many different types of databases that are categorized as NoSQL. There are document-oriented databases, key/value stores, graph databases, column-oriented databases, in-memory databases, and other database types. There are also databases that combine two or more of these properties. It’s easy to criticize something that is vague and loosely defined. As the NoSQL space matures, we’ll start to get some more specific definitions, which will be much more helpful.

Third, at least one very popular vendor in the NoSQL space has a history of making irresponsible claims about their database’s capabilities. Antony Falco of Basho (makers of Riak) has a great blog post on the topic: See It’s Time to Drop the “F” Bomb – or “Lies, Damn Lies, and NoSQL.” If you care about your data, please read Tony’s blog post. It’s unfortunate that the specious claims of a few end up making everyone in the NoSQL space look bad.

I also want to address some of the specific criticisms that I’ve heard of NoSQL, as they apply (or don’t apply) to CouchDB (I’m not familiar enough with other NoSQL databases to talk about those).

SQL Databases Are More Mature

This is absolutely true. If you pick a NoSQL database, you should do your homework and make sure that your database of choice truly respects the fact that writing a reliable database is a very difficult task. Most of the NoSQL databases take the problem very seriously, and try to learn from those that have come before them. But why create a new type of database in the first place? Because an SQL database is not the right solution to every problem. When all you have is a schema, everything looks like a join. The data model in CouchDB (JSON documents) is a great fit for many web applications.

Scaling CouchDBSQL Scales Just Fine

This is also true. If you’re picking a NoSQL database because it “scales”, you’re likely doing it wrong. Scaling is typically more aspiration than reality. There are many other factors to consider and questions to ask when choosing a database technology other than, “does it scale?” If you do actually have to scale, then your database isn’t going to magically do it for you. You can’t abstract scaling problems to your database layer. However, I will say that many NoSQL databases have properties (such as eventually consistency) that will make scaling easier and more intuitive. For example, it’s dead simple to replicate data between CouchDB databases.

Atomicity, Consistency, Isolation, and Durability (ACID)

CouchDB is ACID compliant. Within a CouchDB server, for a single document update, CouchDB has the properties of atomicity, consistency, isolation, and durability (ACID). No, you can’t have transactions across document boundaries. No, you can’t have transactions across multiple servers (although BigCouch does have quorum reads and writes). Not all NoSQL databases are durable (at least with default settings).

If you want the best possible guarantee of durability, you can change CouchDB’s delayed_commits configuration option from true (the default) to false. Basically, this will cause CouchDB to do an explicit fsync after each operation (which is very expensive and slow). Note that operating systems, virtual machines, and hard drives often lie about fsync, so you really need to research more about how your particular system works if you’re concerned about durability. If you think your write speeds are too good to be true, they probably are.

If you leave delayed commits on, CouchDB has the option of setting a batch=ok parameter when creating or updating a document. This will queue up batches of documents in memory and write them to disk when a predetermined threshold has been reached (or when triggered by the user). In this case, CouchDB will respond with an HTTP response code of 202 Accepted, rather than the normal 201 Created, so that the client is informed about the reduced integrity guarantee.

Consistency Checks

At least one NoSQL database requires a consistency check after a crash (guess which one). This can be a very slow process, causing additional downtime. CouchDB’s crash-only design and append-only files means that there is no need for consistency checks. There’s no shut down process in CouchDB—shutting it down is the same as killing the process.

Compaction

CouchDB’s append-only files do come at a cost. That cost is disk space and the need for compaction. If you don’t compact your database, it will eventually fill up your hard drive. There is no automatic compaction in CouchDB. Compaction is triggered manually (it can easily be automated through a cron job) and should be done when the database’s write load is not at full capacity.

Writing and Querying MapReduce Views in CouchDBMapReduce is Limiting and Hard to Understand

It can take some time to get up to speed with MapReduce views in CouchDB. However, it’s not a very difficult concept to understand and most developers are already proficient with JavaScript (the default language for Map and Reduce functions in CouchDB). There’s a lot you can do with MapReduce, but there are some limitations. Views are one dimensional so full text indexing and geospatial data are difficult (if not impossible) to index. However, there are plugins for integrating with Lucene and ElasticSearch. For geospatial data, you can use GeoCouch.

No Ad Hoc Queries

This is a feature, not a bug. CouchDB only lets you query against indexes. This means that queries in CouchDB will be extremely fast, even on huge data sets. Most web applications have predefined usage patterns and don’t need ad hoc queries. If you need ad hoc queries, say for business intelligence reporting, you can replicate your data (using CouchDB’s changes feed) to an SQL database.

Building Indexes is Slow

If you have a large number of documents in CouchDB, the first build of an index will be very slow. However, each query after that will be very fast. CouchDB’s MapReduce is incremental, meaning new or updated documents can be processed without needing to rebuild the entire index. In most scenarios, this means that there will be a small performance hit to process documents that are new or updated since the last time the view was queried. You can optionally include the stale=ok parameter with your query. This will instruct CouchDB to not bother processing new or updated documents and just give you a stale result set (which will be faster than processing new or updated documents). As of CouchDB 1.1, you can include a stale=update_after parameter with your query. This will return a stale result set, but will trigger an update of the index (if necessary) after your query results are returned, bringing the index up-to-date for future queries by you or other clients.

No Schema

Some say that not having a schema is a problem. Sure—if you have structured data, you probably want to enforce a schema. However, not all applications have highly structured data. Many web applications work with unstructured data. If you’ve encountered any of the following, you may want to consider a schema-free database:

  • You’ve found yourself denormalizing your database to optimize read performance.
  • You have rows with lots of NULL values because many columns only apply to a subset of your rows.
  • You find yourself using SQL antipatterns such as entity-attribute-value (EAV), but can’t find any good alternatives that fit with both your domain and SQL.
  • You’re experiencing problems related to the object-relational impedance mismatch. This is typically associated with use of an object-relational mapper (ORM), but can happen when using other data access patterns as well.

I’ll add that you can enforce schemas in CouchDB through the use of document update validation functions.

Anything Else?

Did I miss anything? What other criticisms exist of NoSQL databases? Please comment and I’ll do my best to address each.

15 Comments

  1. Posted July 30, 2011 at 4:54 pm | Permalink

    You said:

    “However, I will say that many NoSQL databases have properties (such as eventually consistency) that will make scaling easier and more intuitive. For example, it’s dead simple to replicate data between CouchDB databases.”

    That’s not really true: replication and eventual consistency increase availability and fault tolerance. Eventual consistency also allow a greater write throughput/lower latency through latency-leveling (at the cost of losing the ability to do atomic read-modify-write loops), but fundamentally if you have more data that a single node can handle with reasonable latency (which, for most web workloads is roughly 3-5 times the available memory) you need to partition your data (usually via either range based partitioning or consistent hashing).

  2. Posted July 31, 2011 at 12:20 am | Permalink

    If you want to take shots at MongoDB and 10gen, at least have the guts to name them.

  3. Posted July 31, 2011 at 1:17 am | Permalink

    “No, you can’t have transactions across document boundaries.”

    Isn’t “No” – you shouldn’t – not “you can’t”. Because, as yet there’s no support built-in for transaction bundles.

    The NoSQL movement stopping many of us getting to “good enough”?

  4. Posted July 31, 2011 at 1:39 am | Permalink

    @Alex: That was intended as a side note and perhaps I could have gone into more detail. You’re absolutely right. However, I was thinking of read-heavy applications when I wrote that note. CouchDB’s replication won’t help you much with write-heavy applications. If you have a write-heavy application then you’ll need to look at something like BigCouch, Lounge, or Pillow (which all implement sharding). I cover all of this in my book, Scaling CouchDB.

  5. Posted July 31, 2011 at 1:46 am | Permalink

    @Luigi: I think it’s important for people to do their own research and not just take my word (or that of anyone else) at face value. If I just said, “MongoDB is a terrible database”, then what good would that do? The links I provided are good starting points for doing some of that research (yes, even the “MongoDB Is Web Scale” video—it’s mainly humorous, but does contain a few things worth noting). Besides, MongoDB isn’t the focus of this post.

  6. Posted July 31, 2011 at 1:50 am | Permalink

    @Clive: I’m not sure I understand your point. Are you suggesting that CouchDB should support transaction bundles, or are you saying that it’s a good thing that it does not?

  7. Posted July 31, 2011 at 2:08 am | Permalink

    @Bradley: I’d like to see support for transaction bundles from the NoSQL movement. Bundles in CouchDB would be great. Because using SQL is over engineering and prohibits horizontal scaling.

  8. Posted July 31, 2011 at 2:38 am | Permalink

    @Bradley

    It isn’t just about scaling reads vs. writes: it’s also the issue of maintaining reasonable throughput and latency when the amount of data is several times (usually 3-5) larger than a single machines memory. In this case, you need to partition the data (Google for “The Case for Shared Nothing” for a discussion of why you need to do this rather than just use a heavy machine and a SAN).

    Out of the systems you’ve mentioned, I am familiar with BigCouch (I’ve known one of its developers): it is indeed a horizontally scalable partitioned system, so it would fit the bill.

  9. Posted July 31, 2011 at 3:26 am | Permalink

    First, NoSQL is a marketing term, not a technology term. Once we acknowledge that, all the arguments about “what exactly is NoSQL” are moot.

    Second, non-relational data management does optimize better than relational data management — but only for a subset of usage of the data. The process of designing a non-relational database is the same as designing a denormalized relational database, e.g. a Data Warehouse. You need to itemize the queries you run against the non-relational data store, and design the data store with those queries in mind.

    It’s not surprising that NoSQL _can_ be faster and more scalable for those queries, if you do this right. So can DW be very scalable, but with similar limitations on the queries that are served by a given DW schema.

    It’s also very easy to get your NoSQL database design wrong if you skip your query analysis step, because one believes the marketing message of “just start putting data in.” This explains some of the disappointments over the past couple of years when companies tried to use NoSQL as a drop-in replacement for RDBMS.

    One workaround is to store data redundantly, in different document collections optimized to serve different query patterns. This is also similar to DW, materialized views, or other denormalizations.

    The problem with NoSQL is not the technology. It’s fine technology when used appropriately, and it fits an important specialty role in data management. The problem is the hype, the advocacy, and the claims that one can get optimization for free without doing analysis. TANSTAAFL is still true.

  10. Posted August 1, 2011 at 11:00 am | Permalink

    @Clive: I think limiting transaction boundaries to one document in CouchDB is a good thing. Having transactions across document boundaries would bring a ton of extra complications along with it that aren’t worth the benefits. If you think you need this feature then ask yourself this: why isn’t all of the related data in one document to begin with?

  11. Posted August 1, 2011 at 11:29 am | Permalink

    @Bill: Thank you for your comments—glad to have an SQL expert weigh-in on the subject! I’ll do my best to address each of your points…

    “First, NoSQL is a marketing term, not a technology term.”

    I completely agree. I personally hate the term “NoSQL” and try to avoid it whenever possible and use a more meaningful term instead (e.g. “document-oriented database”). For better or for worse, it is the word that people are using to describe the class of databases that don’t implement SQL.

    “Second, non-relational data management does optimize better than relational data management — but only for a subset of usage of the data.”

    I’d argue that this “subset” of data usage patterns tends to correlate with the problems encountered by many web applications. For example, content management systems and wikis are a great fit for document-oriented databases. However, I’d like to re-iterate my point that scalability, and to a lesser extent performance, shouldn’t be your only concerns when picking a database technology. Other concerns such as developer productivity, ease of deployments, and fitting your data model to your database are all very important concerns as well. In most situations, I think picking a NoSQL database because it’s “more scalable” or “faster” are warning signs that one hasn’t really considered all of the options thoroughly—and that one probably doesn’t understand what “scalable” and “fast” really mean, in context of one’s entire system.

    “It’s also very easy to get your NoSQL database design wrong if you skip your query analysis step, because one believes the marketing message of “just start putting data in.””

    Database vendors that imply you can “just start putting data in” are being quite irresponsible, in my opinion. I certainly don’t advocate skipping the query analysis step, even when using a NoSQL database. In fact, I think it’s even more critical that you think about your schema design when your database doesn’t enforce a schema. Without guard rails, it’s very easy to fly off the cliff. I recommend using the building block patterns outlined in domain-driven design and serializing each aggregate root (and the entire object graph associated with the aggregate root) to a single JSON document. Designing your aggregate roots involves a large amount of analysis.

    “The problem is the hype, the advocacy, and the claims that one can get optimization for free without doing analysis. TANSTAAFL is still true.”

    Again, I completely agree. I think the hype has been counterproductive to the NoSQL space. Speaking from personal experience, I can say it’s the hype that kept me from exploring NoSQL databases sooner than I did. If you can look past the hype, I think you’ll see that there are some really useful tools to be found in the NoSQL space.

  12. Posted August 4, 2011 at 4:09 am | Permalink

    Hi!

    no criticism of CouchDB but this here:

    “CouchDB is ACID compliant. Within a CouchDB server, for a single document update, CouchDB has the properties of atomicity, consistency, isolation, and durability (ACID). No, you can’t have transactions across document boundaries.”

    is a contradiction, or if it’s not a contradiction, a redefinition of terms that is very confusing.

    In the context of database operations, ACID refers to the properties of a database *transaction*. The concept of a transaction explicitly includes the possibility of multiple interactions with the database. This has been the case since the term ACID was first coined (http://cc.usst.edu.cn/Download/5b953407-339b-46c3-9909-66dfa9c3d52a.pdf).

    What you describe for Couchdb is what is known as an “atomic write” or an “atomic update”. In this terminology, the term atomic has the same meaning as the A in ACID, namely that the operation either completes as a whole or fails as a whole. But in the CouchDB case that operation is not a transaction but an individual write, whereas the term ACID is historically linked to operations that are transactions. Because the term ACID has always been used to describe the properties of *transactions* (and not individual writes), claiming that CouchDB is “ACID compliant” suggests that CouchDB supports ACID-transactions, which it clearly doesn’t.

    To be clear: I am not disputing whether or not the individual properties that make up the ACID acronym apply to CouchDB database writes/updates. I am simply observing that these writes/updates are not transactions. Given that the ACID acronym has such a long history of describing the properties of database transactions, a claim that CouchDB is “ACID-Compliant” will lead to at least some people drawing the incorrect conclusion that CouchDB supports ACID transactions.

    You can of course keep say something like “ACID properties are applicable to CouchDB database operations” (which would be correct although it still has the potential of being misunderstood as though CouchDB has ACID transactions). But let’s consider what ACID means in an environment that only supports atomic writes/updates.

    By definititon, an atomic write is atomic, so that accounts for the A in ACID. So how can we account for the CID part? I know CouchDB supports a validate_doc_update() function so let’s assume that gives you the C. Next up is the I – Isolation.

    I am having a very hard time to come up with a non-trivial meaning for the Isolation property if you’re only handling atomic writes/updates. In the case of ACID-transactions, Isolation describes to what extent sessions are aware of the state of transactions in other sessions. If you’re only handling atomic single operation writes/updates, there is no state, and thus, nothing to isolate.

    Finally, you address durability yourself, and you mention that true durability can be satisfied by disabling the delayed_commits setting, in which case the responsibility for durability is left to the fsync implementation of the underlying platform. At the same time, you seem to imply that this is not really recommended, or at least out of the ordinary.

    So, my conclusions are that:
    1) by default, CouchDB satisfies only the Atomicity in ACID.
    2) if the user chooses to implement validate_doc_update(), they can achieve Consistency (if they will in fact do that, depends on how they implement it).
    3) Isolation is meaningless since there are no transactions
    4) Durability is not enabled by default, and can be achieved only if you can afford it to be slow.

    Given all these ifs and buts, I think it would be more clear to simply not call CouchDB “ACID-Compliant”. It would at least avoid considerable confusion for those that are working with systems supporting ACID transactions.

  13. Posted August 5, 2011 at 12:06 pm | Permalink

    @Roland: Thank you for your thorough comment!

    Atomicity, Consistency, Isolation, and Durability (ACID) have very specific technical meanings. By definition, all transactions must be ACID compliant, but transactions are not the only way to achieve ACID compliance. The ACID properties are the building blocks from which transactions are defined. To say that “ACID” is a synonym of “transaction” is an oversimplification. From the paper you linked to:

    “These fours properties, atomicity, isolation, and durability (ACID), describe the major highlights of the transaction paradigm, which has influenced many aspects of development in database systems.”

    One interesting thing to note is that most, if not all, RDMSs have a concept of implicit transactions. By default, MySQL (and other RDMSs) run in what’s called “autocommit mode”. If you don’t explicitly begin a transaction, the database does a commit (if no error) or rollback (if there’s an error) after each statement. In an RDMS, it’s not just multiple statements run together in a transaction that are ACID compliant, but single statements in implicit transactions are also ACID compliant. In CouchDB, you are limited to the equivalent of this autocommit mode and implicit transactions. These “transactions” are ACID compliant, just like their counterparts in RDMSs.

    CouchDB implements more than just Atomicity. I think it would be helpful to look at what ACID means, in the context of CouchDB.

    Atomicity: Individual document updates succeed or fail (“all or nothing”). This includes the ability to run document update validation and reject the update if it fails validation.

    Consistency: On a single node, any view queries will show a consistent snapshot of the database. In other words, all of the writes that have succeeded will show up in view results. Even if you specify in your query that you’re OK with stale results you will still get a consistent, but old, view of the database.

    Isolation: CouchDB achieves isolation through its multi-version concurrency control (MVCC). Document updates are isolated to the previous version of the document. For example, I can update revision 3-a734 to revision 4-ca7e (revision numbers contrived and shortened). As you can see, these revision numbers represent the state of the document and reads and writes are isolated based on the document’s state.

    Durability: There is no such thing as complete durability. Hard drives crash, data centers explode, entire regions of the country lose power at the same time (so even having multiple data centers may not always help). True durability is a matter of looking at your entire system, not just your database. However, a baseline of durability for databases can be defined as handing off the data to the filesystem and being told that the file has been written to disk. An explicit fsync is not required to meet this definition of durability. Many servers have RAID controllers with battery backed caches. Even with external power cut, these controllers will still get the data written to disk without an explicit fsync. However, I would argue that databases that cache writes to RAM are not durable as these systems now rely on external power and the machine itself not crashing. This is why CouchDB will give a “202 Accepted” rather than a “201 Created” if you specify that batch mode is OK when writing. It’s also worth nothing that *any* database that requires an fsync will be slow to write data (assuming the operating system truly performs an fsync when told to). This is a matter of the laws of physics and spinning disks (SSDs are a different story). Again, if your database writes appear too fast to be true, they probably are.

  14. Posted August 5, 2011 at 8:33 pm | Permalink

    Hi Bradley,

    thanks for your kind words – I forgot to thank you for your original article, which I found highly useful (despite my longwinded comment).

    I just read your retort, and I’d like to comment on that some more:

    “In CouchDB, you are limited to the equivalent of this autocommit mode and implicit transactions. These ‘transactions’ are ACID compliant, just like their counterparts in RDMSs.”

    I understand why you compare CouchDB’s “atomic write/update” to autocomitting ACID transactions. However, I’d argue that they are not the same. I have written about this in some detail a few years ago here: http://rpbouman.blogspot.com/2007/02/mysql-transactions-and-autocommit.html

    The gist of it is that a transaction – autocommitting or not – can (and often will) encompass multiple individual row-level operations. I argue that what I called an “atomic write” in CouchDB should be compared to one such row-level operation in a RDBMS.

    Assuming we can agree on that, then in MySQL it is easy to demonstrates how an SQL statement that affects multiple rows in a table backed by the transactional InnoDB engine in autocommit mode differs from an equivalent SQL statement targeted at a table backed by the non-transactional MyISAM engine. In the InnoDB case, if one single row-level operation fails due to a violation of a database constraint, then all row-operations caused by the current statement up to that immanent violation will be rolled back. In the MyISAM case, the statement simply aborts without undoing any changes caused by the statement prior to the constraint violation.

    So basically, autocommit mode does not fundamentally change the nature of transactions, it merely automatically demarcates them for each statement. But that is independent of the fact that a statement may affect many individual writes that are either committed or rolled back atomically. As far as I can see, CouchDB has no such functionality. Batch operations in CouchDB are capable of executing multiple write operations, but as far as I understand, if such a batch operation is interrupted, changes up until the interrupt will have been effectuated (just like in the MySQL MyISAM example in my article).

    “Consistency: On a single node, any view queries will show a consistent snapshot of the database. In other words, all of the writes that have succeeded will show up in view results.”

    This is an interesting point. It depends on the definition of “consistent snapshot”. I just don’t know enough about CouchDB, so I wonder: suppose we have session 1 executing a view that would, in absence of any other, concurrent operations return documents A,B and C. Now suppose another session 2 is started immediately after session 1 was started, but before it returns document C. Now let’s assume session 2 deletes document C. Will session 1 return document C? I should probably come up with a few more tests for this before I can fully understand and compare this to ACID transactions.

    Concerning isolation: I believe I understand what isolation means in the CouchDB case. My point was mainly that because of the absence of a transaction concept, this kind of isolation is virtually not comparable to the concept of isolation in a transactional system. In the case of atomic writes, there doesn’t seem to be any opportunity to witness whether sessions are isolated from each other. Calling this kind of isolation “ACID-isolation” seems a major source of confusion to me.

    Regarding durability: I agree this is a thorny subject.

  15. Posted August 8, 2011 at 2:06 pm | Permalink

    @Roland: I’m glad you found the original post useful! For what it’s worth, I didn’t find your comment long-winded. It challenged me to really think about what ACID means in the context of CouchDB.

    You bring up some interesting points. However, I still maintain that even though CouchDB does not support transactions, it is ACID compliant. I agree that this can be confusing. Being ACID compliant is often thought of as being the same as supporting transactions.

    On a side-note, this got me wondering if MyISAM was ACID compliant or not. Everyone (I hope) knows that it doesn’t support transactions—but are individual statements ACID compliant? I’m almost completely sure that they are not. In which case, I think one could non-redundantly say that MyISAM is neither ACID compliant nor does it support transactions.

    To re-iterate your scenario in CouchDB terminology: Client 1 requests a view that returns data from Documents A, B, and C. I believe the view as it exists at the moment the response begins to be sent will be returned to Client 1. Effectively, this means that all writes that have succeeded, up until that moment, to Documents A, B, and C will be visible to Client 1 as part of the view’s response. Client 2 deletes Document C immediately after CouchDB started building the response for Client 1, but before that response has been sent to Client 1. In this case, I believe that CouchDB will include the data from the previous revision (before the delete) of Document C in the response to Client 1. This is the data that had been indexed, and at the point-in-time that the view was requested it was a valid and consistent view of the database.

    In CouchDB, reads don’t block writes, and writes don’t block reads. This is one of the benefits of its multi-version concurrency control (MVCC) and append-only files. To be ACID compliant, only writes that have fully completed will be read. Not sure if this helps, but an interesting conversation regardless!

One Trackback

  1. [...] Karwin’s comment is spot [...]