Recap on CAP
The CAP theorem, also known as Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency (all nodes see the same data at the same time)
- Availability (node failures do not prevent survivors from continuing to operate)
- Partition Tolerance (the system continues to operate despite arbitrary message loss)
CAP and NoSQL
Many of the disk based NoSQL implementations was originated from the need to deal with write scalability. This was largely due to the changes in traffic behavior that was mainly a result of the social networking in which most of the content is generated by the users and not by the site owner.
In a traditional database approach achieving data consistency requires synchronous write to disk and distributed transactions (known as the ACID properties).
It was clear that the demand for write scalability would conflict with the traditional approaches for achieving consistency (synchronous write to a central disk and distributed transactions).
The solution to that was: 1) Breaking the centralized disk access through partitioning of the data into distributed nodes. 2) Achieve high availability through redundancy (replication of the data into multiple nodes) 3) Use asynchronous replication to reduce the write latency.
The assumptions behind point 3 above is going to be the center in this specific post.
Graphical representation of the Cap Theorem. (Source)
The Consistency Challenge
One of the common assumptions behind many of the NoSQL implementations is that to achieve write scalability we need to push as many operations on the write-path to a background process in order that we could minimize the time in which a user transaction is blocked on write.
The implication is that with asynchronous write we loose consistency between write and read operations i.e. read operation can return older version then that of write.
There are different algorithms that were developed to address this type of inconsistency challenges, often referred to as Eventual Consistency.
For those interested in more information on that regard i would recommend looking at Jeremiah Peschka post Consistency models in nonrelational dbs. Jeremiah provides a good (and short!) summary of the CAP theorem, Eventual Consistency model and other common principles that comes with it such as (BASE – Basically Available Soft-state Eventually, NRW, Vector clock,..).
Do we really need Eventual Consistency to achieve write scalability?
Before I’ll dive into this topic i wanted to start with quick introduction to the term “Scalability” which is often used interchangeably with throughput. Quoting Steve Haines:
The terms “performance” and “scalability” are commonly used interchangeably, but the two are distinct: performance measures the speed with which a single request can be executed, while scalability measures the ability of a request to maintain its performance under increasing load
(See previous post on that regard: The true meaning of linear scalability)
In our specific case that means that write scalability can be delivered primarily through point 1 and 2 above ( 1-Break the centralized disk access through partitioning of the data into distributed nodes. 2-Achieve high availability through redundancy and replication of the data into multiple nodes) where point 3 ( Use asynchronous replication to those replica’s to avoid the replication overhead on write) is mostly related with write throughput and latency and not scalability. Which bring me to the point behind this post:
Eventual consistency have little or no direct impact on write scalability .
To be more specific my argument is that it is quite often enough to break our data model into partitions (a.k.a shards) and break out from the centralized disk model to achieve write scalability. In many cases we may find that we can achieve sufficient throughput and latency just by doing that.
We should consider the use of asynchronous write algorithms to optimize the write performance and latency but due to the inherited complexity that comes with it we should consider that only after we tried simpler alternative such as using db-shards, FLASH disk or memory based devices.
Achieving write throughput without compromising consistency or scalability
The diagram below illustrates one of the examples by which we could achieve write scalability and throughput without compromising on consistency.
As with the previous examples we break our data into partitions to handle our write scaling between nodes. To achieve high throughput we use in-memory storage instead of disk. As in-memory device tend to be significantly faster and concurrent then disk and since network speed is no longer a bottleneck we can achieve high throughput and low latency even when we use synchronous write to the replica.
The only place in which we’ll use asynchronous write is the write to the long-term-storage (disk). As the user transaction doesn’t access the long-term storage directly through the read or write path, they are not exposed to the potential inconsistency between the memory storage and the long-term storage. The long-term storage can be any of the disk based alternatives starting from a standard SQL databases ending with any of the existing disk based NoSQL engines.
The other benefit behind this approach is that it is significantly simpler. Simpler not just in terms of development but simpler to maintain compared with the Eventual Consistency alternatives. In case of distributed system simplicity often correlate with reliability and deterministic behavior.
It is important to note that in this post i was referring mostly to the C in CAP and not CAP in its broad definition. My points was not to say don’t use solution that are based on CAP/EventualConsistency model but rather to say don’t jump on Eventual Consistency based solutions before you considered the implications and alternative approaches. There are potentially simpler approaches to deal with write scalability such as using database shards, or In-memory-data-grids.
As were reaching the age of Terra-Scale devices such as Cisco UCS where we can achieve huge capacity of memory, network and compute power in a single box the area’s in which we can consider to put our entire data in-memory get significantly broader as we can easily store Terra bytes of data in just few boxes. The case of Foursquare’s MongoDB Outage is interesting on that regard. 10gen’s CEO Dwight Merrimanargued that the entire set of data actually needs to be served completely in-memory:
For various reasons the entire DB is accessed frequently so the working set is basically its entire size Because of this, the memory requirements for this database were the total size of data in the database. If the database size exceeded the RAM on the machine, the machine would thrash, generating more I/O requests than the four disks could service.
It is a common misconception to think that putting part of the data in LRU based cache ontop of a disk based storage could yeild better performance as noted in the sanford research The Case for RAM Cloud
..even a 1% miss ratio for a DRAM cache costs a factor of 10x in performance. A caching approach makes the deceptive suggestion that “a few cache misses are OK” and lures programmers into con-figurations where system performance is poor..
In that case using pure In-Memory-Data-Grid as a front end and disk based storage as long term storage could potentially work better and with significantly lower maintenance overhead and higher determinism. The capacity of data in this specific case ( <100GB) shouldn’t be hard to fit into single UCS box or few of the EC2 boxes.