CouchApps at jQuery Conference Boston 2011

Here are the slides from today’s jQuery Conference presentation on CouchApps with CouchDB & jQuery:

If you were in this talk, please give me feedback on SpeakerRate.

Related links:

Speaking at CouchConf NYC

I’m pleased to announce that I’ll be speaking at CouchConf New York City on October 24, 2011. This event is part of the CouchConf World Tour presented by Couchbase. My talk will be on CouchApps with CouchDB, JavaScript and HTML5. From the talk description:

In this talk we’ll see how to build CouchApps using CouchDB, Javascript, and HTML5. We’ll look at related tools such as the couchapp command ine tool, the Evently jQuery plugin, the CouchDB API jQuery plugin, the CouchApp Loader, Pathbinder, and the Mustache templating framework.

Defining a RESTful Framework

Web application frameworks have varying support for the concepts behind Representational State Transfer (REST). Most web application frameworks, if not all, allow you to create “fully” RESTful web applications. However, there does not seem to be a focus on explicitly applying RESTful principles. So, here are the key concepts that I’d like to see addressed:

  • Embrace, and don’t abstract, the Hypertext Transfer Protocol (HTTP).
  • Focus on entities/resources—identified by full Uniform Resource Identifiers (URIs).
  • Use HTTP methods (GET, POST, PUT, DELETE, OPTIONS, HEAD) to perform operations on entities/resources.
  • Allow for self-described messages through the use of header fields, such as Accept and Content-Type.
  • Make hypermedia controls a core concept (perhaps using PHACTOR as a starting point), not just a byproduct of rendering.
  • Think of web applications as state transition systems. Representations of entities/resources are states, and hypermedia controls define the available state transitions.

Benefits:

  • Interoperability: Focusing on open standards allows for easier integration with other systems.
  • Cacheability: Embracing HTTP gives you many caching options, almost for free.
  • Testability: Self-contained and self-describing messages are very testable. Decomposing a complex system into states and available state transitions greatly reduces the complexity of the system and its tests.

The Case For Rapid Release Cycles

There has been some discussion recently on the Zend Framework mailing list around release cycles. I proposed a release cycle of six months for major versions (someone else suggested eighteen months, which may be more reasonable for a framework). Rapid releases allow one to accelerate the cycle of building, measuring, and learning. Gathering data from actual usage (measuring) provides an opportunity for learning that can be applied to the next release (building).

Zend Framework 2.0 should be released soon, and it has been four years since the last major release (1.0). This is not to imply that Zend Framework has been stagnant—far from it. There has been a ton of development effort and many improvements to Zend Framework since 1.0. I have a great amount of trust in the team, and I have complete confidence that Zend Framework 2.0 will be an awesome framework. This post is intended to make the case for rapid release cycles for software in general, and is not meant to be a criticism of Zend Framework or the development processes behind it. However, the discussions around Zend Framework’s release cycle are what’s prompting me to make this post.

First, let me describe what I mean by a “rapid release cycle”. In this context, I mean rapid releases of major versions. Put simply, major versions are those that allow backwards compatible breaks. This is somewhat controversial. I don’t think anyone really has any big concerns with the rapid release of minor (introduction of new features while maintaining backwards compatibility) or maintenance (bug and/or security fix) releases. “Rapid” depends on the context. Both Chrome and Firefox have adopted a six-week release cycle. As I mentioned before, six months could be considered “rapid” for a framework.

For a framework (and maybe for other software), I think the following rules are necessary in order for a rapid release cycle to work:

  • Minimize backwards compatibility changes between major releases. Targeted and strategic refactoring, rather than major overhauls, are preferable if you are releasing often. Small backwards compatibility changes makes migrating from one major version to another much easier.
  • Mark some major releases as “Long Term Support” (LTS) releases. Provide bug fix updates and security patches to these releases for three to five years. This provides a “safe” option to those who value stability and don’t want to upgrade very often. In the context of Zend Framework, it is obviously Zend’s decision if they want to take on this burden. If not, then I don’t think a rapid release cycle is viable.

What are the concerns with a rapid release cycle? I’ll paraphrase, and then address, the major concerns that I’ve heard.

“Rapid releases of major versions are just for psychological effect and has no affect on the delivery pace of new features.” This is both true and false. See my earlier post on iterative vs. incremental development. If development is incremental and driven entirely by a pre-determined roadmap then there is no tangible differences between a “normal” and a rapid release cycle. The development of many consumer software packages is perceived as incremental, in which case major version bumps are mostly psychological. However, if you take an iterative development approach and build-in outside learning from end-users into your process, then a rapid release cycle gives you the chance to change course based on outside feedback. Learning opportunities are introduced that you would have never had if your software wasn’t actually used by real people in the real world.

“Rapid release cycles are for consumer software where you don’t have to care for backward compatibility.” This is related to the previous concern. My response is that rapid release cycles are for any product where learning from real-world usage and outside input can be used to improve the product. To quote Steve Blank, “There are no answers inside the building.”

“It forces people to upgrade too often and rewrite their code, or get left behind.” See my earlier note about minimizing backwards compatibility changes in each major release. Additionally, it is much easier to automate upgrades if the backwards compatibility changes are small. There should be little code rewriting for applications built using the framework with each major version upgrade of the framework.

“Having lots of end-of-life (EOL) versions being used could be a security risk.” See my earlier note about providing LTS releases. Each major release should come with a pre-determined EOL date. It is the responsibility of the end-user (in the case of a framework, the developer) to be aware of a release’s EOL date. Using EOLed software is always a security risk.

While not specifically a concern with rapid release cycles, there’s a general mentality that major releases are “our chance to get it right.” Hopefully you’re a better software developer than you were even six months ago. Chances are you know more than you did then, and would approach solving problems differently now. Think six months before that, and six months before that. Now project this into the future. Where will you be in six months? Will you know more than you do now? Will you approach solving problems differently than you do now? If you’re a good software developer, you will never get it “right”—you will always be better six months from now than you are today and know more than you know today. A rapid release cycle allows you to apply new learning, knowledge, and perspective as often as possible. Do your best today, and give yourself opportunities to do your best in the future as well.

Propose a Session for Vermont Code Camp 2011

Vermont Code CampVermont is a beautiful place to visit—especially in the fall! We’re looking for Vermonters and non-Vermonters alike to speak at this year’s Vermont Code Camp. Vermont Code Camp is organized entirely by community volunteers, with the help of our great sponsors (we’re still accepting sponsorships, too). Vermont Code Camp is a polyglot event. We’re looking for sessions on .NET, PHP, Ruby, Python, Java, and more. Abstracts are due this Friday, August 12 and we’re going to try and have the session list available by August 19. Check out the 2010 schedule to get an idea of what we had for talks last year.

Personally, I’d like to encourage submissions on the following topics:

  • More talks related to free/open source software (PHP, Ruby, Python, etc.)
  • Arduino and/or other hardware programming
  • Software development patterns and practices
  • Web analytics and metrics
  • Front-end web development (JavaScript, CSS, etc.)
  • Node.js and other emerging technologies
  • Mobile application development
  • Anything you’re passionate about!

Propose a session or two now—you know you want to!

CouchDB and Domain-Driven Design

I’ve found CouchDB to be a great fit for domain-driven design (DDD). Specifically, CouchDB fits very well with the building block patterns and practices found within DDD. Two of these building blocks include Entities and Value Objects. Entities are objects defined by a thread of continuity and identity. A Value Object “is an object that describes some characteristic or attribute but carries no concept of identity.” Value objects should be treated as immutable.

Aggregates are groupings of associated Entities and Value Objects. Within an Aggregate, one member is designated as the Aggregate Root. External references are limited to only the Aggregate Root. Aggregates should follow transaction, distribution, and concurrency boundaries. Guess what else is defined by transaction, distribution, and concurrency boundaries? That’s right, JSON documents in CouchDB.

Let’s take a look at an example Aggregate, that representing a blog entry and related metadata. Note that the following UML diagrams are for classes in PHP, but it should be easy enough to translate these examples to any object-oriented programming language. We’ll start with the Entry Entity, which will serve as our Aggregate Root:

-----------------------------------------
|                 Entry                 |
-----------------------------------------
|+ id : string                          |
|+ rev : string                         |
|+ title : Text                         |
|+ updated : Date                       |
|+ authors : Person[*]                  |
|+ content : Text                       |
-----------------------------------------
|+ __construct(entry : array) : void    |
|+ toArray() : array                    |
-----------------------------------------

The Text Value Object:

----------------------------------------------
|                    Text                    |
----------------------------------------------
|- type : string                             |
|- text : string                             |
----------------------------------------------
|+ __construct(type : string, text : string) |
|+ toArray() : array                         |
----------------------------------------------

The Date Value Object:

--------------------------------------
|                Date                |
--------------------------------------
|- timestamp : integer               |
--------------------------------------
|+ __construct(timestamp : integer)  |
|+ __toString() : string             |
--------------------------------------

The Person Value Object:

-------------------------------------------------------------
|                           Person                          |
-------------------------------------------------------------
|- name : string                                            |
|- uri : string                                             |
|- email : string                                           |
-------------------------------------------------------------
|+ __construct(name : string, uri : string, email : string) |
|+ toArray() : array                                        |
-------------------------------------------------------------

I recommend serializing each Aggregate, starting with the Aggregate Root, into a JSON document. Control access to Aggregate Roots through a Repository. The toArray() methods above return an associative array representation of each object. The Repository can then transform the array into JSON for storage in CouchDB. Let’s take a look at the EntryRepository:

---------------------------------
|        EntryRepository        |
---------------------------------
|                               |
---------------------------------
|+ get(id : string) : Entry     |
|+ post(entry : Entry) : void   |
|+ put(entry : Entry) : void    |
|+ delete(entry : Entry) : void |
---------------------------------

Here’s an example of what the Aggregate’s object graph might look like, serialized as a JSON document:

{
    "_id": "http://bradley-holt.com/?p=1251",
    "title": {
        "type": "text",
        "text": "CouchDB and Domain-Driven Design"
    },
    "updated": "2011-08-02T15:30:00+00:00",
    "authors": [
        {
             "name": "Bradley Holt",
             "uri": "http://bradley-holt.com/",
             "email": "bradley.holt@foundline.com"
        }
    ],
    "content": {
        "type": "html",
        "text": "<p>I've found CouchDB to be a great fit for…</p>"
    }
}

You can also provide access to CouchDB views through Repositories. In the above example, this could be through the addition of an index(skip : integer, limit : integer) : Entry[*] method to the the EntryRepository (note that this is a naive pagination implementation, especially on large data sets—but that’s beyond the scope of this blog post). For more complex views, you may want to create a separate Repository for each CouchDB view.

Addressing the NoSQL Criticism

There were quite a few NoSQL critics at OSCON this year. I imagine this was true of past years as well, but I don’t know that first hand. I think there are several reasons behind the general disdain for NoSQL databases.

First, NoSQL is horrible name. It implies that there’s something wrong with SQL and it needs to be replaced with a newer and better technology. If you have structured data that needs to be queried, you should probably use a database that enforces a schema and implements Structured Query Language. I’ve heard people start redefining NoSQL as “not only SQL”. This is a much better definition and doesn’t antagonize those who use existing SQL databases. An SQL database isn’t always the right tool for the job and NoSQL databases give us some other options.

Second, there are way too many different types of databases that are categorized as NoSQL. There are document-oriented databases, key/value stores, graph databases, column-oriented databases, in-memory databases, and other database types. There are also databases that combine two or more of these properties. It’s easy to criticize something that is vague and loosely defined. As the NoSQL space matures, we’ll start to get some more specific definitions, which will be much more helpful.

Third, at least one very popular vendor in the NoSQL space has a history of making irresponsible claims about their database’s capabilities. Antony Falco of Basho (makers of Riak) has a great blog post on the topic: See It’s Time to Drop the “F” Bomb – or “Lies, Damn Lies, and NoSQL.” If you care about your data, please read Tony’s blog post. It’s unfortunate that the specious claims of a few end up making everyone in the NoSQL space look bad.

I also want to address some of the specific criticisms that I’ve heard of NoSQL, as they apply (or don’t apply) to CouchDB (I’m not familiar enough with other NoSQL databases to talk about those).

SQL Databases Are More Mature

This is absolutely true. If you pick a NoSQL database, you should do your homework and make sure that your database of choice truly respects the fact that writing a reliable database is a very difficult task. Most of the NoSQL databases take the problem very seriously, and try to learn from those that have come before them. But why create a new type of database in the first place? Because an SQL database is not the right solution to every problem. When all you have is a schema, everything looks like a join. The data model in CouchDB (JSON documents) is a great fit for many web applications.

Scaling CouchDBSQL Scales Just Fine

This is also true. If you’re picking a NoSQL database because it “scales”, you’re likely doing it wrong. Scaling is typically more aspiration than reality. There are many other factors to consider and questions to ask when choosing a database technology other than, “does it scale?” If you do actually have to scale, then your database isn’t going to magically do it for you. You can’t abstract scaling problems to your database layer. However, I will say that many NoSQL databases have properties (such as eventually consistency) that will make scaling easier and more intuitive. For example, it’s dead simple to replicate data between CouchDB databases.

Atomicity, Consistency, Isolation, and Durability (ACID)

CouchDB is ACID compliant. Within a CouchDB server, for a single document update, CouchDB has the properties of atomicity, consistency, isolation, and durability (ACID). No, you can’t have transactions across document boundaries. No, you can’t have transactions across multiple servers (although BigCouch does have quorum reads and writes). Not all NoSQL databases are durable (at least with default settings).

If you want the best possible guarantee of durability, you can change CouchDB’s delayed_commits configuration option from true (the default) to false. Basically, this will cause CouchDB to do an explicit fsync after each operation (which is very expensive and slow). Note that operating systems, virtual machines, and hard drives often lie about fsync, so you really need to research more about how your particular system works if you’re concerned about durability. If you think your write speeds are too good to be true, they probably are.

If you leave delayed commits on, CouchDB has the option of setting a batch=ok parameter when creating or updating a document. This will queue up batches of documents in memory and write them to disk when a predetermined threshold has been reached (or when triggered by the user). In this case, CouchDB will respond with an HTTP response code of 202 Accepted, rather than the normal 201 Created, so that the client is informed about the reduced integrity guarantee.

Consistency Checks

At least one NoSQL database requires a consistency check after a crash (guess which one). This can be a very slow process, causing additional downtime. CouchDB’s crash-only design and append-only files means that there is no need for consistency checks. There’s no shut down process in CouchDB—shutting it down is the same as killing the process.

Compaction

CouchDB’s append-only files do come at a cost. That cost is disk space and the need for compaction. If you don’t compact your database, it will eventually fill up your hard drive. There is no automatic compaction in CouchDB. Compaction is triggered manually (it can easily be automated through a cron job) and should be done when the database’s write load is not at full capacity.

Writing and Querying MapReduce Views in CouchDBMapReduce is Limiting and Hard to Understand

It can take some time to get up to speed with MapReduce views in CouchDB. However, it’s not a very difficult concept to understand and most developers are already proficient with JavaScript (the default language for Map and Reduce functions in CouchDB). There’s a lot you can do with MapReduce, but there are some limitations. Views are one dimensional so full text indexing and geospatial data are difficult (if not impossible) to index. However, there are plugins for integrating with Lucene and ElasticSearch. For geospatial data, you can use GeoCouch.

No Ad Hoc Queries

This is a feature, not a bug. CouchDB only lets you query against indexes. This means that queries in CouchDB will be extremely fast, even on huge data sets. Most web applications have predefined usage patterns and don’t need ad hoc queries. If you need ad hoc queries, say for business intelligence reporting, you can replicate your data (using CouchDB’s changes feed) to an SQL database.

Building Indexes is Slow

If you have a large number of documents in CouchDB, the first build of an index will be very slow. However, each query after that will be very fast. CouchDB’s MapReduce is incremental, meaning new or updated documents can be processed without needing to rebuild the entire index. In most scenarios, this means that there will be a small performance hit to process documents that are new or updated since the last time the view was queried. You can optionally include the stale=ok parameter with your query. This will instruct CouchDB to not bother processing new or updated documents and just give you a stale result set (which will be faster than processing new or updated documents). As of CouchDB 1.1, you can include a stale=update_after parameter with your query. This will return a stale result set, but will trigger an update of the index (if necessary) after your query results are returned, bringing the index up-to-date for future queries by you or other clients.

No Schema

Some say that not having a schema is a problem. Sure—if you have structured data, you probably want to enforce a schema. However, not all applications have highly structured data. Many web applications work with unstructured data. If you’ve encountered any of the following, you may want to consider a schema-free database:

  • You’ve found yourself denormalizing your database to optimize read performance.
  • You have rows with lots of NULL values because many columns only apply to a subset of your rows.
  • You find yourself using SQL antipatterns such as entity-attribute-value (EAV), but can’t find any good alternatives that fit with both your domain and SQL.
  • You’re experiencing problems related to the object-relational impedance mismatch. This is typically associated with use of an object-relational mapper (ORM), but can happen when using other data access patterns as well.

I’ll add that you can enforce schemas in CouchDB through the use of document update validation functions.

Anything Else?

Did I miss anything? What other criticisms exist of NoSQL databases? Please comment and I’ll do my best to address each.

CouchApps at OSCON 2011

Here are the slides from today’s OSCON presentation on CouchApps with CouchDB, JavaScript & HTML5:

Related links:

Learning CouchDB at OSCON 2011

Here are the slides from today’s OSCON Data workshop on Learning CouchDB:

Related links:

Exploring RabbitMQ and PHP

I’m exploring the possibility of using RabbitMQ for an upcoming project. RabbitMQ is a free/open source message broker platform. It uses the open Advanced Message Queuing Protocol (AMQP) standard and is written in Erlang using the Open Telecom Platform (OTP). It promises a high level of availability, throughput, scalability, and portability. Since it is built using open standards, it is interoperable with other messaging systems and can be accessed from any platform.

I’ll be using RabbitMQ first from PHP, but I plan on using it to send and receive messages to and from other systems. Following are the steps I used to get RabbitMQ and PHP’s AMQP library setup on my development machine.

First, I installed RabbitMQ using MacPorts:

$ sudo port install rabbitmq-server

Then, I started RabbitMQ:

$ sudo rabbitmq-server -detached

Next, I installed the librabbitmq library using a slight variation of the instructions on PHP’s AMQP Installation page (you may need to install Mercurial first):

$ hg clone http://hg.rabbitmq.com/rabbitmq-c/rev/3c549bb09c16 rabbitmq-c
$ cd rabbitmq-c
$ hg clone http://hg.rabbitmq.com/rabbitmq-codegen/rev/f8b34141e6cb codegen
$ autoreconf -i && ./configure && make && sudo make install

Then, I installed the AMQP extension using PECL:

$ sudo pecl install amqp-beta

To test that everything works, I opened up two interactive PHP shells using php -a. I ran the following code in the first PHP shell:

$exchangeName = 'messages';
$routeKey = 'routeA';
$message = 'Hello, world.';
$connection = new AMQPConnection();
$connection->connect();
$exchange = new AMQPExchange($connection);
$exchange->declare($exchangeName);

I then ran the following code in the second PHP shell:

$exchangeName = 'messages';
$routeKey = 'routeA';
$connection = new AMQPConnection();
$connection->connect();
$queue = new AMQPQueue($connection);
$queue->declare($exchangeName);
$queue->bind($exchangeName, $routeKey);

Back in the first PHP shell:

$exchange->publish($message, $routeKey);

Back in the second PHP shell:

$message = $queue->get();
print_r($message);

Here is the output I got from the print_r statement:

Array
(
    [routing_key] => routeA
    [exchange] => messages
    [delivery_tag] => 1
    [Content-type] => text/plain
    [count] => 0
    [msg] => Hello, world.
)

There are several other options that can be set, and a lot more to learn about RabbitMQ and AMP. Check out the documentation for PHP’s AMQP extension for details about working with AMQP servers from PHP.