Disks are not reliable!
Recent analysis of disk failure patterns (see references below) break the disk reliability assumption completely. Interestingly enough, even though these studies were published in 2007, most people I speak to aren’t aware of them at all. In fact, I still get comments from people who are basing their reliability solution on disk-based persistency, insisting on the argument that whether or not you are reliable amounts to whether or not your data is synched to a disk.
Below are some of the points I gathered from the research, which should surprise most of us:
- Actual disk failure/year is 3% (vs. estimates of 0.5 – 0.9%) – this is a 600% difference on reported vs. actual disk failure. The main reason is that manufacturers can’t tests their disks for years, so their tests are an extrapolation of statistical behavior which proves to be completely inaccurate.
- There is NO correlation between failure rate and disk type – whether it is SCSI, SATA, or fiber channel. Most data centers are based on the assumption that investing in high-end disks and storage devices will increase their reliability – well, it turns out that high-end disks exhibit more or less the same failure patterns as regular disks! John Mitchell had an interesting comment on this matter during our Qcon talk, when someone pointed to RAID disks as their solution for reliability. John said that since RAID is based on an exact H/W replica that lives in the same box, there is a very high likelihood that if a particular disk fails, its replica will fail in the same way. This is because they all have the exact same model, handle the exact same load and share the same power/temperature.
- There is NO correlation between high disk temperature and failure rates – I must admit that this was a big surprise to me, but it turned out that there is no correlation between disk temperature and disk failure. In fact, most failures happen when disks were at low or average temperature.
Why are existing database clustering solutions so breakable?
Most existing database clustering solutions rely on a shared disk storage to maintain their cluster state, as can be seen in the diagram below.
Typical database clustering architecture based on shared storage
The core, implicit assumption behind Oracle RAC, IBM DB2, and many other database clustering solutions, is that failure can be avoided by purchasing high-end disk storage and using expensive hardware (fiber optics, etc). As can be seen from the research I mentioned earlier, this core assumption doesn’t correlate with the failure statistics. Hence I argue that the database clustering model is inherently breakable.
Failure is inevitable – cope with it
Jason McHugh, a senior engineer at Amazon who works on S3, gave a very interesting talk at the Qcon conference (Amazon S3: Architecting For Resieliency In The Face Of Failures), where he outlined the service failure handling architecture of Amazon’s Simple Storage. Jason’s main point was that rather than trying to prevent failure from happening, you should assume that failure is inevitable and instead, design your system to cope with it.
Failure can happen in various shapes and forms:
- Expect drives to fail
- Expect network connections to fail (independent of the redundancy in networking)
- Expect a single machine to go out
Human error example: Failure to use caution while pushing a rack of servers
How to design your application for resilience in the face of failure
Jason outlined some of the lessons from his experience at Amazon:
- Expect and tolerate failures – build your infrastructure to cope with catastrophic failure scenarios: “Distributed systems will experience failures, and we should design systems which tolerate rather than avoid failures. It ends up being less important to engineer components for low failure rates (MTBF) than it is to engineer components with rapid recovery rates (MTTR).” (from Architecture Summit 2008 – Write up)
- Code for large-scale failures – design your application to cope with the fact that services can come and go.
- Expect and handle data and message corruption – don’t rely only on TCP to ensure the consistency of your data. Add validation as part of your application code to make sure that messages are not corrupted.
- Code for elasticity – load/demand may grow unexpectedly. Elasticity is essential to dealing with unexpected demand without bringing the system to its knees. It is also important that when you exceed your capacity, the system won’t crash but instead report denial of service.
- Monitor, extrapolate and react – monitoring at the right level enables you to detect failures before they happen and take the appropriate corrective actions.
- Code for frequent single machine failures – rack failure can cause an immediate disappearance of multiple machines at the same time.
- Game days – simulate failures and learn how to continuously improve your operational procedure for handling failures and architecture.
Memory can be more reliable then disk
Many people assumes that memory is an unreliable data storage. That assumption holds true if your data “lives” on a single machine; in this case if the machine fails or crashes your application crashes. But what if you distribute the data across a cluster of nodes and maintain more than one copy of the data over the network? In this case, if a node crashes the data is not gone; it lives elsewhere and can be continuously served from one of its replicas.
Under these conditions, I can argue that under certain conditions, memory can be more reliable than disk. One of the reason is the hardware itself – unlike disk drives that rely on mechanical moving parts, RAM is just a silicon chip. As such the chances for failure can be significantly smaller. Unfortunately, I don’t have the data to back up this argument, but I think it’s very likely that this is the case. In an article published on InfoQ titled RAM is the new disk, Tim Bray brings an interesting observation on this matter:
Memory is the new disk! With disk speeds growing very slowly and memory chip capacities growing exponentially, in-memory software architectures offer the prospect of orders-of-magnitude improvements in the performance of all kinds of data-intensive applications. Small (1U, 2U) rack-mounted servers with a terabyte or more or memory will be available soon, and will change how we think about the balance between memory and disk in server architectures. Disk will become the new tape, and will be used in the same way, as a sequential storage medium (streaming from disk is reasonably fast) rather than as a random-access medium (very slow). Tons of opportunities there to develop new products that can offer 10x-100x performance improvements over the existing ones.
As a matter of fact we (GigaSpaces/Cisco) are working these very days on benchmarking Cisco UCS, where we are evaluating the possibility to manage x100GB and potentially even Terabytes of data in a single node. The results look extremely promising, so the future is not that far ahead.