A few weeks ago, I wrote a post describing the drive behind the demand for a new form of database alternatives, often referred to as NOSQL. A few weeks ago during my Qcon presentation, I went through the patterns of building a scalable twitter application, and obviously one of the interesting challenges that we discussed is the database scalability challenge. To answer that question I tried to draw the common pattern behind the various NOSQL alternatives, and show how they address the database scalability challenge. In this post I’ll try to outline these common principles.
The Common Principles Behind the NOSQL Alternatives
Assume that Failure is Inevitable
Unlike the current approach where we try to prevent failure from happening through expensive HW, NOSQL alternatives were built with the assumption that disks, machines, and networks fail. We need to assume that we can’t prevent these failures, and instead, design our system to cope with these failures even under extreme scenarios. Amazon S3 is a good example in that regard. You can find a more detailed description in my recent post Why Existing Databases (RAC) are So Breakable!. There I outlined some of the lessons on how to architect for failures, from Jason McHugh‘s presentation. (Jason is a senior engineer at Amazon who works on S3.)
Partition the Data
By partitioning the data, we minimize the impact of a failure, and we distribute the load for both write and read operations. If only one node fails, the data belonging to that node is impacted, but not the entire data store.
Keep Multiple Replicas of the Same Data
Most of the NOSQL implementations rely on hot-backup copies of the data, to ensure continuous high availability. Some of the implementations provide you with a way to control it at the API level, i.e. when you store an object, you can specify how many copies of that data you want to maintain at the granularity of an object level. With GigaSpaces, we are also able to fork a a new replica to an alternate node immediately, and even start a new machine if it is required. This enables us to avoid the need to keep many replicas per node, which reduces the total amount of storage and therefore cost associated with it.
You can also control whether the replication should be synchronous or asynchronous, or a combination of the two. This determines the level of consistency, reliability and performance of your cluster. With synchronous replication, you get guaranteed consistency and availability at the cost of performance (a write operation followed by a read operation is guaranteed to return the same version of the data, even in the case of a failure). The most common configuration with GigaSpaces, is synchronous replication to the backup, and asynchronous to the backend storage.
In order to handle the continuous growth of data, most NOSQL alternatives provide a way of growing your data cluster, without bringing the cluster down or forcing a complete re-partitioning. One of the known algorithms that is used to deal with this, is called consistent hashing. There are various algorithms implementing consistent hashing.
One algorithm notifies the neighbors of a certain partition, that a node joined or failed. Only those neighbor nodes are impacted by that change, not the entire cluster. There is a protocol to handle the transitioning period while the re-distribution of the data between the existing cluster and the new node takes place.
Another (and significantly simpler) algorithm uses logical partitions. With logical partitions, the number of partitions is fixed, but the distribution of partitions between machines is dynamic. So for example, if you start with two machines and 1000 logical partitions, you have 500 logical partitions per machine. When you add a third machine, you have 333 partitions per machine. Since logical partitions are lightweight (they are basically a hash table in-memory), it is fairly easy to distribute them.
The advantage of the second approach is that it is fairly predictable and consistent, whereas with the consistent hashing approach, the distribution between partitions may not be even, and the transition period when a new node joins the network can take longer. A user may also get an exception if the data that he is looking for is under transition. The downside of the logical partitions approach, is that the scalability is limited to the number of logical partitions.
For more details in that regard, I recommend reading Ricky Ho’s post entitled NOSQL Patterns.
This is an area where there is a fairly substantial difference between the various implementations.The common denominator is a key/value matching, as in a hash table. Some implementations provide more advanced query support, such as the document-oriented approach, where data is stored as blobs, with an associated list of key/value attributes. In this model you get a schema-less storage that makes it easy to add/remove attributes from your documents, without going through schema evolution etc. With GigaSpaces we support a large portion of SQL. If the SQL query doesn’t point to a specific key, the query is mapped to a parallel query to all nodes, and aggregated at the client side. All this happens behind the scene and doesn’t involve user code.
Use Map/Reduce to Handle Aggregation
Map/Reduce is a model that is often used to perform complex analytics, that are often associated with Hadoop. Having said that, it is important to note that map/reduce is often referred to as a pattern for parallel aggregated queries. Most of the NOSQL alternatives do not provide built-in support for map/reduce, and require an external framework to handle these kind of queries. With GigaSpaces, we support map/reduce implicitly as part of our SQL query support, as well as explicitly through an API that is called executors. With this model, you can send the code to where the data is, and execute the complex query directly on that node.
For more details in that regard, I recommend reading Ricky Ho’s post entitled Query Processing for NOSQL DB.
Disk-Based vs. In-Memory Implementation
NOSQL alternatives are available as a file-based approach, or as an in-memory-based approach. Some provide a hybrid model that combines memory and disk for overflow. The main difference between the two approaches comes down mostly to cost/GB of data and read/write performance.
An analysis done recently by Stanford University, called “The Case for RAMCloud” provides an interesting comparison between the disk and memory-based approaches, in terms of cost performance. In general, it shows that cost is also a function of performance. For low performance, the cost of the disk is significantly lower the RAM-based approach, and with higher performance requirements, the RAM becomes significantly cheaper.
The most obvious drawbacks of RAMClouds are high cost per bit and high energy usage per bit. For both of these metrics RAMCloud storage will be 50-100x worse than a pure disk-based system and 5-10x worse than a storage system based on flash memory (see  for sample configura
tions and metrics). A RAMCloud system will also require more floor space in a datacenter than a system based on disk or flash memory. Thus, if an application needs to store a large amount of data inexpensively and has a relatively low access rate, RAMCloud is not the best solution.
However, RAMClouds become much more attractive for applications with high throughput requirements. When measured in terms of cost per operation or energy per operation, RAMClouds are 100-1000x more efficient than disk-based systems and 5-10x more efficient than systems based on flash memory. Thus for systems with high throughput requirements a RAM-Cloud can provide not just high performance but also energy efficiency. It may also be possible to reduce RAMCloud energy usage by taking advantage of the low-power mode offered by DRAM chips, particularly during periods of low activity. In addition to these disadvantages, some of RAM-Cloud’s advantages will be lost for applications that require data replication across datacenters. In such environments the latency of updates will be dominated by speed-of-light delays between datacenters, so RAM-Clouds will have little or no latency advantage. In addition, cross-datacenter replication makes it harder for RAMClouds to achieve stronger consistency as described in Section 4.5. However, RAMClouds can still offer exceptionally low latency for reads even with cross-datacenter replication.
Is it Just Hype?
One of the most common questions that I get this days is: “Is all this NOSQL just hype?”, or “Is it going to replace current databases?”
My answer to these questions is that the NOSQL alternatives didn’t really start today. Many of the known NOSQL alternatives have existed for more than a decade, with lots of successful references and deployments. I believe that there are several reasons why this model has become more popular today. This first is related to the fact that what used to be a niche problem that only a few fairly high-end organizations faced, became much more common with the introduction of social networking and cloud computing. Secondly, there was the realization that many of the current approaches could not scale to meet demand. Furthermore, cost pressure also forced many organizations to look at more cost-effective alternatives, and with that came research that showed that distributed storage based on commodity hardware can be even more reliable then many of the existing high end databases. (You can read more on that here.) All of this led to a demand for a cost effective “scale-first database”. I quote James Hamilton, Vice President and Distinguished Engineer on the AWS team, from one of his articles One Size Does Not Fit All:
“Scale-first applications are those that absolutely must scale without bound and being able to do this without restriction is much more important than more features. These applications are exemplified by very high scale web sites such as Facebook, MySpace, Gmail, Yahoo, and Amazon.com. Some of these sites actually do make use of relational databases but many do not. The common theme across all of these services is that scale is more important than features and none of them could possibly run on a single RDBMS”
So to sum up – I think that what we are seeing is more of a realization that existing SQL database alternatives are probably not going away any time soon, but at the same time they can’t solve all the problems of the world. Interestingly enough the term NOSQL has now been changed to Not Only SQL, to represent that line of thought.