In the past few weeks I’ve had many interesting discussions on our new Elastic Middleware and how it fits into the current enterprise world. One of the interesting comments that was made is that in order to make use of elasticity in general you need to have a spare pool of resources — while the reality is that most enterprises don’t have many spare resources in their data center. Moreover, taking a mission-critical application that needs to provide deterministic behavior and moving it around in a production environment can be a scary idea for the operations team.
I think that the best way to answer these concerns is through an example: In this example I will describe how a top firm managed to maintain the high availability of their production application under an extreme failure scenario. The purpose of the example is to illustrate that elasticity applies to lots of areas where we have complex procedures that must be automated, and it is basically used to ensure that our application meets a given SLA in a continuous basis and provide enough control points so that you can control where and how those automated functions happen.
The Application: Real-Time Web Application
The application itself is a real-time web application that enables users to subscribe to different news items in real time. The application itself targets millions of subscribes and has to manage x100GB of data. The application uses a fairly standard stack -– load balancer and web container (Tomcat in this specific case) at the front, and a GigaSpaces In-Memory-Data-Grid front-ending an Oracle database at the back. The application uses the In-Memory-Data-Grid for managing session high availability and as the system of record that enables fetching the data at memory speed for the company’s online subscribers.
The Challenge: Ensuring Application Continuous Availability During Planned and Unplanned Downtime
To ensure the system’s continuous high availability, it needs to tolerate both planned and unplanned downtime scenarios including the extreme case of a complete data center failure.
Planned downtime happens as part of an organization’s standard procedure for hardware refresh as well as a preemptive measure to clean their system. Unplanned downtime, as the name suggests, can happen anytime, anywhere. An extreme scenario would be a complete failure of the data center.
Ensuring High Availability During Planned Downtime
The following steps ensure high availability of the data during planned downtime:
- Bring the target machine down.
- Down-scale the system to the currently available set of machines to ensure that existing users connected to the system will continue to be served.
- Launch a new alternative machine.
- Synchronize the machine with the existing live cluster: While the machine was down, the system continued to serve user requests; so before we bring a new machine to the pool we need to ensure that it contains the current state of the cluster.
- Re-balance the resources so that the load is spread with the newly available machine.
One thing that immediately pops up when I look at all those steps is: the fact that were dealing with planned downtime doesn’t make the actual recovery procedure any simpler. In fact it would have been exactly the same procedure as with un-planned downtime. The only thing that is different is that we may control the time it happens and be there to in case something goes wrong. Now it is clear that if we need to manually follow all those steps the chances that something will go wrong is fairly high. In addition to that if we have lots of application in our data center that needs to follow this procedure then the cost and complexity of following through that procedure is going to be extremely high up to the point where we it becomes unmanageable.
Automating planned downtime recovery using the elastic middleware
With elastic middleware, an agent runs on each machine and reports the machine’s availability and its state. In addition, we can interact with the machine and instruct it to do almost anything we want through an API. In our specific case, when we bring a new machine into the system we detect its availability and then relocate the relevant resources from the existing set of machines into the new machine in order bring the system back to its normal state.
Ensuring Continuous High Availability During a Disaster
In a post-9/11 world, every serious enterprise maintains a disaster recovery site to ensure its continuous availability in case of a disaster failure, where a complete data center could become out of reach for a long period of time.
The challenge for the enterprise in our specific case was not just to ensure complete recovery, but to do so while the application is running, with no — or with only minimal — hiccups.
In this case, we had the machines of both the disaster recovery site and the primary site connected through a 10 msec latency network as if it was a single cluster.To deal with the latency, we grouped each site into what we refer to as zones. With zones, we can split an entire cluster into sub-clusters and apply policies only to a specific group based on their zone affinity. In this case, we used the elastic middleware to ensure that all backup and primary nodes would be evenly distributed between the two zones. If one of the zones went down, the available site would turn its backup nodes into primaries and would continue to serve the entire load of currently active users, so the users would hardly feel any hiccups.
As soon as the site comes back on, the elastic middleware re-balances the resources and bring them back to the previous state.
Failures are Inevitable. Cope with It!
Many enterprises’ high-availability architecture is based on the assumption that you can prevent failure from happening by putting all your critical data in a centralized database, back it up with expensive storage, and replicate it somehow between the sites. As I argued in one of my previous posts (Why Existing Databases (RAC) are So Breakable!) many of those assumptions are broken at their core, as storage is doomed to failure just like any other device, expensive hardware doesn’t make things any better and database replication is often not enough.
One of the main lessons that we can take from the likes of Amazon and Google is that the right way to ensure continuous high availability is by designing our system to cope with failure. We need to assume that what we tend to think of as unthinkable will probably happen, as that’s the nature of failure. So rather than trying to prevent failures, we need to build a system that will tolerate them.
As we can learn from a recent outage event in one of Amazon’s cloud data centers, we can’t rely on the data center alone to solve this type of failure. The knowledge of how to manage failure must be built into our application:
“By launching instances in separate Availability Zones, you can protect your applications from failure of a single location,” Amazon notes in a FAQ on its Elastic Compute Cloud service
According to IDC, the total cost of downtime in 2009 was estimated at $400 billion, an average of $8,000 per server per year. We don’t need academic research to know that the chances of downtime grow exponentially in proportion to the number of moving parts. This becomes even more challenging as we start to build larger and larger systems to deal with the demand for scale. Under these conditions, providing built-in automation becomes a necessity rather than a luxury, and indeed during the poll that we conducted it turns out that 100% of the people in the poll thought that same way.
Pawel Plaszczak wrote an interesting post on The cost of High Availability (HA) with Oracle that provides an interesting insight into the actual cost associated with the current high availability models:
The original “DB v1” option was priced at $611K. After the license tuning exercise, the total for “DB v2″ option came in at $518K – a saving of 93K.
For this type of project, the major cost considerations are hardware and software, while services and support are marginal. For “DB v1”, cost breakdown is: 36% hardware, 40% software, 13% services, 11% support.
The GigaSpaces Elastic middle-ware is built as an abstraction layer on top of the GigaSpaces Cluster Admin API which is in GA since 7.0 release and is already in use by many of the customers. The Elastic middleware implementation is already available for beta users as part of the 7.1 release.
- The interactive cloud
- Why Existing Databases (RAC) are So Breakable
- Amazon S3: Architecting For Resieliency In The Face Of Failures By Jason McHugh
- Failure Trends in a Large Disk Drive Population – Google, Inc., February 2007
- Brief Power Outage for Amazon Data Center