When it comes to implementing a CouchDB installation, do you roll your own, or go with a service that provides a hosted version of the database? We’re going to look at some of the technologies present in CouchDB that can greatly influence that decision.
Knowing Your Technology
Before deciding which type of hosting to use, it’s vital that you clearly understand the technology in play. For an upper-management suit without technical experience, the situation might appear simple: with hosted plans, the company doing the hosting does most of the work. But that’s a gross over-simplification: the software developers have to create their code so that it properly scales, while you, the member of management, need a basic understanding of the technology so you can ask the right questions of the hosting companies. Let’s look at the technology first, and see if we can draw some conclusions.
In years past, smaller companies might have had an internal database server or two, and then an IT guy doing backups. Then the option came along to have software and databases hosted by an outside hosting company. So the business managers had a decision to make: Pay an outside company, which would provide the servers and staff, or maintain internal servers and pay an inside staff. If they decided to move to a Web host, they might move their database to the host—but there would likely have been a backup somewhere else.
The database systems were already powerful enough and would have error-checking and recovery; if something catastrophic happened, it was a matter of going to the backups. After the catastrophe occurred and before the database was restored, the system would be down. To prevent that downtime, the system could use replication, wherein data would be duplicated across more than one server; if one went down, the other server would pick up the extra work.
Of course, this added complexity, because now the two servers had to be identical: if data was added to one, it would have to be added to the other as well. The big database manufacturers included software to help manage the replication. And ultimately this was another decision for the suits: Did they want a single server and risk having downtime, or did they pay extra for replicated servers? (Having lived through those years, I can tell you that the suits would often look at the bottom line of dollars and cents; in their opinion, furious customers and downed servers wasn’t necessarily a bad thing if costs were down as well.) In other words: Either pay more money, or pay in the form of having to deal with angry customers and employees.
Whether to replicate servers wasn’t the only issue facing your typical IT shop: there was also the not-so-tiny problem of servers running out of space or simply overloading. While replicating servers can help with the overload problem, the problem of running out of space can be handled by dividing a single database among multiple servers. This is called sharding, which works well, but also opens up a great deal more technical issues. For example, a single query might need results from more than one shard, and the database engine needs to retrieve them and present them back as a single result set. The servers need to be able to handle all the different situations that sharding presents.
In the database world, one well-known problem is that a distributed computer system can simultaneously offer only two of the following three items:
Consistency: Where you ensure that all your servers are identical.
Availability: Where you make sure your data is always available.
Partition tolerance: This one seems to have a couple definitions, depending on who’s talking. The CouchDB people say that it means the data can be safely divided among multiple servers. But most other sources say it means that, while the data is distributed, a portion of the system can go down and the rest of the system will continue to operate. Either way, the point is that the data is distributed.
This so-called “CAP theorem” is based on ideas offered by Eric Brewer in 2000. If you demand consistency, then reads will have to wait until all nodes are consistent, which affects availability, and so on.
Think back to what I said earlier about data being replicated between servers. Suppose you have two servers that you intend to host identical data; as part of the consistency requirement, that data needs to be inserted into both servers. The CAP theorem says we can pick two out of three: In the case of CouchDB, you get partition tolerance and high availability; but on the consistency front, CouchDB’s designers can only offer what’s known as “eventual consistency,” where the servers are eventually synchronized.
Suppose your data is spread out across the planet, but its users need quick access to all of it at any time. If data is inserted into a CouchDB server in California, the same data will be replicated in a server in India—eventually. But CouchDB doesn’t guarantee it will be inserted in the second one immediately.
The term “eventual consistency” seems to turn a lot of people off about databases like CouchDB. Indeed, there are a lot of cases where eventual consistency really is a bad thing, such as real-time financial data. But that’s why we choose the database based on the needs of the system, and don’t automatically go for a NoSQL system just because it’s the latest and greatest cool thing.
When is eventual consistency acceptable? Look at Amazon.com, for example. If a seller posts a book on Amazon.com, and the associated Webpage becomes immediately available to people in one geographic region, but not until several seconds later for people on the other side of the world, it doesn’t make much difference in the long run. In a real-world context, “eventually” usually means seconds or minutes—not hours or days.
Of course, you have the final say: Does your application require immediate consistency or not?
Today’s database systems handle replication, sharding, and other such features just fine. But a business manager will need to understand what the terms mean and why they’re important when making a decision on what to purchase: servers in-house, hosted servers in a cloud where the IT staff manage the databases, or a fully-managed cloud-based service.
In this CouchDB scenario, we’re mainly looking at the two hosted options, which I’ll call self-serve and fully-managed, and it comes down to this: With self-serve hosting, the business manager needs to know how many virtual servers in the cloud need to be purchased (thus the cost). With fully-managed, there’s usually a flat rate per hour or month.
Technological Differences between Cloudant and CouchDB
Let’s compare CouchDB with Cloudant from a technical perspective. Cloudant, which fully manages CouchDB databases, forked the CouchDB code and created their own cloud-ready form of CouchDB, which they called BigCouch. While based on CouchDB, and mostly compatible with CouchDB, it was a separate system.
Cloudant has served as something of a database testbed for the past five years, building advanced clustering features; now those features are going to be integrated into the standard CouchDB distribution, which means the two products are the same.
(One big feature Cloudant offered was integration with Lucene for full-text searches. From my own experience, this is a great feature, because the Lucene searches can be extremely fast. And while it’s technically not a feature of CouchDB, there are open-source and free Lucene add-ins for CouchDB. So this really isn’t something that Cloudant offers that isn’t available elsewhere.)
CouchDB vs. Couchbase
While Cloudant was busily adding scalability features to CouchDB, another group was working on a related tangent, resulting in a product called Couchbase Server. Couchbase Server is the result of merging CouchDB with another platform called MemBase. The company that creates Couchbase Server is called Couchbase, Inc.
Couchbase Server, like CouchDB, is freely available as an open-source product. However, there’s also a commercially oriented Enterprise Edition, which some have argued is “not entirely open-source.”
Damien Katz, who originally created CouchDB, also started Couchbase. He termed Couchbase “the successor” to CouchDB, and it’s meant to compensate for the latter’s lack of modern-day scalability features. The same people who developed Memcahced also developed MemBase, which is why working with Couchbase is very much like working with memcached.
As a result, we have two “children” of the original CouchDB: Cloudant’s BigCouch additions to CouchDB, and Couchbase, with Couchbase being a child of not just CouchDB but also a marriage with Memcached.
Comparing the original CouchDB to Couchbase is simple: Couchbase offers higher performance and better scalability that the original CouchDB. But now performance and scalability are being integrated into CouchDB via Cloudant’s offerings; as a result, there are just two big differences between Couchbase and the newest CouchDB with BigCouch added in:
First: CouchDB has an HTTP interface, whereas Couchbase does not. Instead, Couchbase uses a protocol based on Memcached. (But you still have client libraries for developing for Couchbase, as you do with CouchDB.) There’s also a RESTful API available for Couchbase, but it can’t be used for data read and write operations; it’s mainly for overall management operations.
Second: Relating to the CAP theorem, Couchbase moves away from the A part (Availability) and towards the C part (Consistency).
Managing it Yourself vs. Letting Others Manage it
Ultimately, the technical differences between what you can get out of a self-hosted plan and a managed plan are going away. Even before the big merge, the BigCouch code was available as an open-source package on Git, and you were free to install it; you could host BigCouch yourself, without Cloudant’s help. What it really comes down to is People Power: do you want to pay to have Cloudant host it or not? Similarly, you have the option to install Couchbase and do it yourself (or pay a company like IrisCouch to manage it all for you).
One other thing: Cloudant is hosting their BigCouch installations in the cloud—options include many cloud hosts, including Amazon EC2. So you could play Cloudant, or you could save a step and allocate the cloud servers yourself. (And if you have your own datacenter, with exclusive access to your own drives and backups, all this is moot.)
My Own Personal Feelings
Because we’re a small firm, we decided to go with Cloudant—we found there’s no real cost savings to a host such as Amazon EC2 or Rackspace; and this way, we know there’s a team of people monitoring the servers and making sure nothing goes wrong.
If you’re going to be self-hosting—unless you’re working on a really small system—don’t use the basic CouchDB for anything. If you want scalability, either go with Couchbase or BigCouch, or wait until Cloudant’s BigCouch merger into CouchDB is officially available.
If you’re self-hosting and using the scalability features of BigCouch or Couchbase, the process should go smoothly—but take the time to learn what you’re doing. If you’re not familiar with sharding and replication, learn it. This isn’t a matter of just installing a database and letting it rip.
If you’re self-hosting and want happy customers and few headaches, don’t just install a single server and a single instance. Use the scalability features.
If you’re planning to self-host, and don’t want to take the time to master the scalability features (or understand them but don’t want to deal with them), then seriously consider a managed hosting solution.
Regardless of which managed hosting service you go with, make sure they fully understand scalability, sharding, replication, and all that—and make sure they’re actually doing it. The last thing you need is to go with a host only to find that they’re simply running a single server, and you discover this after they have a catastrophe and your data is lost and your customers are angry.
Image: Duc Dao/Shutterstock.com