Patterns, Guidelines and Best Practices Revisited
In my previous post I analyzed Amazon’s recent AWS outage and the patterns and best practices that enabled some of the businesses hosted on Amazon’s affected availability zones to survive the outage.
The patterns and best practices I presented are essential to guarantee robust and scalable architectures in general and on the cloud in particular. Those who dismissed my latest post as exaggeration of an isolated incident got affirmation of my statement last week when Amazon found itself apologizing once again after its Cloud Drive service was overwhelmed by unpredictable peak demand for Lady Gaga’s newly-released album (99 cents, who wouldn’t buy it?!) and was rendered non-responsive. This failure to scale up/out to accommodate fluctuating demands raises the scalability concern in the public cloud, in addition to the resilience concern raised in the AWS outage.
Surprisingly, as obvious as the patterns I listed may seem, it seems they are definitely not common practice, seeing the amount of applications that went down when AWS did, and seeing how many other applications have similar issues on public cloud providers.
Why are such fundamental principles not prevalent in today’s architectures on the cloud?
One of the reasons these patterns are not prevalent in today’s cloud applications is that it requires an experienced and confident architect in the areas of distributed and scalable systems to design such architectures. The typical public cloud APIs also require developers to perform complex coding and utilize various non-standard APIs that are usually not common knowledge. Similar difficulties are found in testing, operating, monitoring and maintaining such systems. This makes it quite difficult to implement the above patterns to ensure the application’s resilience and scalability, and diverts valuable development time and resources from the application’s business logic that is the core value of the application.
How can we make the introduction of these patterns and best practices smoother and simpler? Can we get these patterns as a service to our application? We are all used to traditional application servers that provide our enterprise applications with underlying services such as database connection pooling, transaction management and security, and free us from worrying about these concerns so that we can focus on designing our business logic. Similarly, Elastic Application Platforms (EAP) allow your application to easily employ the patterns and best practices I enumerated on my previous post for high availability and elasticity without having to become experts in the field and allowing you to focus on your business logic.
So what is Elastic Application Platform? Forrester defines an elastic application platform as:
An application platform that automates elasticity of application transactions, services, and data, delivering high availability and performance using elastic resources.
Last month Forrester published a review under the title “Cloud Computing Brings Demand For Elastic Application Platforms”. The review is the result of a comprehensive research, and spans 17 pages (a blog post introducing it can be found on the Forrester blog). It analyzes the difficulties companies encounter in implementing their applications on top of cloud infrastructure, and recognizes the elastic application platforms as the emerging solution for a smooth path into the cloud. It then maps the potential providers of such solutions. For its research Forrester interviewed 17 vendor and user companies. Out of all the reviewed vendors, Forrester identified only 3 vendors that are “offering comprehensive EAPs today”: Microsoft, SalesForce.com and GigaSpaces.
As Forrester did an amazing job in their research reviewing and comparing solutions for EAP today, I’ll avoid repeating that. Instead, I’d like to review the GigaSpaces EAP solution in light of the patterns discussed on my previous post, and see how building your solution on top of GigaSpaces enables you to introduce these patterns easily and without having to divert your focus from your business logic.
Patterns, Guidelines and Best Practices Revisited
Design for failure
Well, that’s GigaSpaces’ bread and butter. Whereas thinking about failure diverts you from your core business, in our case it is our core business. GigaSpaces platform provides underlying services to enable high availability and elasticity, so that you don’t have to take care of that. So now that we’ve established that, let’s see how it’s done.
Stateless and autonomous services
The GigaSpaces architecture segregates your application into Processing Units. A Processing Unit (PU) is an autonomous unit of your application. It can be a pure business-logic (stateless) unit, or hold data in-memory, or provide a web application, and mix together these and other functions. You can define the required Service Level Agreement (SLA) for your Processing Unit, and the GigaSpaces platform will make sure to enforce it. When your Processing Unit SLA requires high-availability – the platform will deploy a (hot) backup instance (or multiple backups) of the Processing Unit to which the PU will fail over in case the primary instance fails. When your application needs to scale out – the platform will add another instance of the Processing Unit (maybe over a newly-provisioned virtual machine booted automatically by the platform). When your application needs to distribute data and/or data processing – the platform will shard the data evenly on several instances of the Processing Unit, so that each instance will handle a subset of the data independently of the other instances.
Redundant hot copies spread across zones
You can divide your deployment environment into virtual zones. These zones can represent different data centers, different cloud infrastructure vendors, or any physical or logical division you see fit. Then you can tell the platform (as part of the SLA) not to place both primary and its backup instances of the Processing Unit on the same zone – thus making sure the data stored within the Processing Unit is backed up on two different zones. This will provide your application resilience over two data centers, two cloud vendors, two regions, depending on your required resilience, all with uniform development API. You want higher level of resilience? Just define more zones and more backups for each PU.
Spread across several public cloud vendors and/or private cloud
GigaSpaces abstracts the details of the underlying infrastructure from your application. GigaSpaces’ Multi-Cloud Adaptor technology provides built-in integration with several major cloud providers, including the JClouds open source abstraction layer, thus supporting any cloud vendor that conforms to the JClouds standard. So all you need to do is plug in your desired cloud providers into the platform, and your application logic remains agnostic to the cloud infrastructure details. Plugging in two vendors to ensure resilience now becomes just a matter of configuration. The integration with JClouds is an open-source project under OpenSpaces.org, so feel free to review and even pitch in to extend and enhance integration with cloud vendors.
Automation and Monitoring
GigaSpaces offers a powerful set of tools that allow you to automate your system. First, it offers the Elastic Processing Unit, which can automatically monitor CPU and memory utilization and employ corrective actions based on your defined SLA. GigaSpaces also offers a rich Administration and Monitoring API that enables administration and monitoring of all the GigaSpaces services and components and layers running beneath the platform such as transport layer and, machine and operating system. GigaSpaces also offers a web-based dashboard and a management center client. Another powerful tool for monitoring and automation is the administrative alerts that can be configured and then viewed through GigaSpaces or external tools (e.g. via SNMP traps).
Avoiding ACID services and leveraging on NoSQL solutions
GigaSpaces does not rule out SQL for querying your data. We believe that true NoSQL stands for “Not Only SQL”, and that SQL as a language is good for certain uses, whereas other uses require other query semantics. GigaSpaces supports some of the SQL language through its SQLQuery API or through standard JDBC . However, GigaSpaces also provides a rich set of alternative standards and protocols for accessing your data, such as Map API for key/value access, Document API for dynamic schemas, Object-oriented (through proprietary Space API or standard JPA), and Memcached protocol.
Another challenge of the traditional relational databases is scaling data storage in read-write environment. The distributed relational databases were enough to deal with read-mostly environments. But Web2.0 brought social concepts into the web, with customers feeding data into the websites. Several NoSQL solutions try to address distributed data storage and querying. GigaSpaces provides this via its support for clustered topology of the in-memory data grid (the “space”) and for distributing queries and execution using patterns such as Map/Reduce and event-driven design.
The elastic nature of the GigaSpaces platform allows it to automatically detect the CPU and memory capacity of the deployment environment and optimize the load dynamically based on your defined SLA, instead of employing arbitrary division of the data into fixed zones. Such dynamic nature also allows your system to adjust in case of a failure of an entire zone (such as what happened with Amazon’s availability zones) so that your system doesn’t go down even in such extreme cases, and maintains optimal balance under the new conditions.
Furthermore, GigaSpaces platform supports content-based routing, which allow for smart load balancing based on your business model and logic. Content-based routing allows your application to route related data to the same host and then execute entire business flows within the same JVM, thus avoiding network hops and complex distributed transaction locking that hinder your application’s scalability.
Most significant advancements do not happen in slow gradual steps but rather in leaps. These leaps happen when the predominant conception crashes in face of the new reality, leaving behind chaos and uncertainty, and out of the chaos then emerges the next stage in the maturity of the system.
This is the case with the maturity process of the young cloud realm as well: the AWS outage was a major reality check that opened the eyes of the IT world to see that their systems crashed with AWS because they counted on their cloud infrastructure provider to handle your application’s high-availability and elasticity using its generic logic. This concept proved to be wrong. Now the IT world is in confusion, and many discussions are done on whether the faith in cloud was mistaken, with titles like “EC2 Failure Feeds Worries About Cloud Services”.
The next step in the cloud’s maturity was the realization that cloud infrastructure is just infrastructure, and that you need to implement your application correctly, using patterns and best practices such as the ones I raised in my previous post, to leverage on the cloud infrastructure to gain high-availability and elasticity.
The next step in the evolution is to start leveraging on designated application platforms that will handle these concerns for you and virtualize the cloud from your application, so that you can simply define the SLA for your application for high-availability and elasticity, and leave it up to the platform to manipulate the cloud infrastructure to enforce your SLA, while you concentrate on writing your application’s business logic. As Forrester said:
… A new generation of application platforms for elastic applications is arriving to help remove this barrier to realizing cloud’s benefits. Elastic application platforms will reduce the skill required to design, deliver, and manage elastic applications, making automatic scaling of cloud available to all shops …