Many web applications, including eBanking, Trading, eCommerce and Online Gaming, face large, fluctuating loads. Some of the load is predictable, for example, if you launch a new campaign you expect to see certain growth around that campaign; some of the load is seasonal, like in Christmas for most eCommerce sites or quarterly-based in other cases. But today, with social networks and the speed at which we can disseminate information around the globe, we’re seeing more and more unpredictable load. This effect was known in the financial industry as the “perfect storm,” which basically meant that a chain of events in the market led to a storm of events that brought many systems down. In social networks, this type of viral behavior is also referred to as the Digg effect. The Digg effect means that a site can generate a single news item that will generate more traffic than was ever anticipated to be handled by the site, in a fraction of a second. The barrier for generating such events has been significantly reduced, due to all the new tools available today such as twitter, Digg and Facebook, a trend is appearing across many industries where over time, online systems need to face ever-growing, unpredictable peak load events.
The effect of peak-load provisioning
A traditional way of dealing with loads is to provision the system for peak load. You try to estimate what’s going to be the peak load on the system, then double the number to have enough spare in case you were wrong, and buy the amount of hardware that will be able to handle this most extreme scenario. Interestingly enough, that number determines not just the amount of servers you’ll in the production environment, but also the staging and to a lesser degree the testing and development environments. It also determines the number of software licenses, etc. Now, when this number is low – say 2,3 servers – that’s not ideal but still manageable. But what happens when you need 10 servers or 100 servers? Also, what will the organization’s data center will look like if each application provisions its own spare capacity? Its really simple to imagine, right? On average, the data center will be extremely over-provisioned, which means the system will be extremely under-utilized most of the time, which means a huge waste of money.
The other side of being over-provisioned is being under-provisioned. Being under-provisioned means that you have too little resources to handle the traffic that the system is experiencing. Well, that’s just a way to put this in nice words. Remember the perfect storm?
Public failure due to under-provisioning
I’ve collected only a few examples that show what under provisioning can lead to – failure! The fact that the system is now online means that failure is also online, and quickly becomes a news item, resulting in a PR nightmare for your company.
“Right-sizing” – an oxymoron?
By right-sizing, I mean provisioning the system to fit its needs exactly. There’s a small caveat here – how do you know the right size? With the viral behavior of traffic these days, trying to predict the load is almost like trying to predict the weather – you can probably have a good guess for next month, but your guess is going to be less accurate the longer the period you’re trying to estimate. So the question is, can we do right-sizing?
Right-sizing under fluctuating load
A basic assumption is that you can’t do right-sizing for a long period of time – even a year. You need to estimate the load for short window of time, and then be able to up-scale or down-scale the system system easily when you see there is a need for it. I think there are primarily two ways we can achieve this type of scaling – flexible scaling and auto scaling:
- Flexible scaling – this means that scaling up or down will be a matter of plugging a machine into the network and wouldn’t require re-architecture, changes to our code or even configuration changes. The scaling operation would require a manual process for adding a machine, but wouldn’t require complex development and tuning and, can be done in matter of minutes once the machine is plugged into the network.
- Auto-scaling – means that you can monitor the behavior of the system, and based on these metrics, decide in real time whether it needs to grow or shrink. To enable true auto-scaling, we need to have a pool of available resources that we can pull or release in an on-demand basis. This is basically cloud computing – a cloud (whether a public cloud, like Amazon EC2, or a private one running on-premise) provides the flexibility to add these machines on demand and pay for only for what we consume.
The missing piece – middleware virtualization
If you take pretty much any existing application, add another machine into the network, and watch what happens, you wouldn’t be surprised that nothing much would happen at all.
Today’s applications aren’t able to dynamically take advantage of new computing resources that become available. That’s also true for cloud based environments like EC2 – the fact that you can now add machines easily is nice, but it doesn’t mean that the application can do anything with it. What is missing is a layer that helps the application take advantage of these new resources dynamically, as they are being added to the system. This is where middleware virtualization comes to the rescue.
Like in storage virtualization, where we can add physical disks easily without touching the application that is utilizing them, we can do the same for the web, business logic, messaging and data layers of our applications. The idea is very similar to storage: we decouple the application from the physical resources and then use a client-side proxy that maps between the application API calls and the appropriate resource that will be used to serve that call. On the web layer, you can use a load-balancer to get to this level of virtualization; on the messaging, business logic and data layers, you can use In-Memory-Data-Grids for the same purpose.
Where to start?
By now, you’re probably nodding your head with agreement, but at the same time you’re probably scratching your head thinking, “how the hell do I get there without re-writing my entire application?” Indeed, everything I’ve been saying is pure common sense, yet for years the industry has struggled with this challenge and hasn’t found a way to apply it effectively. So what’s different now?
The step-by-step approach
IMO, the key to successfully achieving flexible scaling or auto-scaling is by finding the right steps that will get you there with minimum risks. The first steps need to show value with very little effort. Once you get into a better comfort level, you can think of taking more serious steps.
If you take a simple web application and break it down to its components (web, business-logic, data) you can easily see that scaling the data layer would probably be harder then scaling the web layer. In this post I’ll focus on the steps required to scale the web layer.
How to add auto-scaling to your existing web tier
- Dynamically provisioning the web container
- Dynamically discovering the web containers and adding them to the load balancer
In a stateless application, these two should do the trick. However, in a more stateful application, you’ll also need to address the way session data is handled when adding or removing machines. For this, you need to be able to address high availability of session data and and data affinity between the load balancer and the web container.
1. Making the session reliable – to make session information reliable, you can back them in an In-Memory Data Grid, which is kept outside the web container. This way, if the web container changes location due to a scaling event, the session information is kept in that shared memory, and a new server can load it on-demand. To avoid the performance overhead of using remote memory, you can use a local cache that is loaded on-demand. For session information, this is fairly useful, because during a particular session it is very likely that you’ll need to read the same user data over and over again. The combination of local and remote Data Grid allows you to get to the great performance of local memory, while still keeping the memory reliable, and without having to overload each server will all the session data.
2. Maintaining data affinity between the load balancer and the web container: In many cases data affinity is done through “sticky sessions”. This basically means that the load balancer “remembers” the first server the user interacted with, and then during the entire session routes all requests from that particular user to the same server. This method is often based on identifying the server based on its IP – but in a dynamically-scalable environment, this assumption no longer holds. The same user session can be on one particular host at one point in time, and then on a different host in another point in time. To overcome this issue, you need to use a routing scheme that isn’t based on host names, but rather on logical IDs. This means that each web container needs to have a logical ID, and the load-balancer can then route the requests based on this ID. The ID will be consistent regardless of where this web container is running, so this solves the problem.
Further optimization – avoiding session stickiness
Imagine a scenario where a user is now associated with a particular server during a session, and that server becomes overloaded. This is where session stickiness becomes a scalability limitation. To avoid this limitation, you can keep the session data outside the web container on a shared In-Memory Data Grid. This means the same user session can be served at any point in time by any of the web containers – which gives you better scalability. The downside of this approach is performance, because you wouldn’t be exploiting the local memory of each web container. You’d also need to apply more stringent locking on the data, because in this scenario a concurrent update might happen under the same user session.
Try it now on the Amazon cloud
All of the above looks nice on paper – we all know how it works, right? :)
But does it really work? We’ve all gone through the experience of reading something we like, and then going and getting it and being disappointed. And in this case, “trying it out” can be fairly complex, because you need to download the software, read the manual, get several machines deployed and configure the environment, and only then we you really give it a try. If you’re like me, you’ll probably give up right there and go through these steps only if you don’t have a choice. This is another area where cloud-computing can be really handy – getting a production-ready test-drive environment.
We’ve just launched a new service on where we give you access to Amazon EC2, with GigaSpaces XAP running as part of the environment, allowing you to experience how this model works without needing to download any software, setup machines, or configure the clustering and load balancer. You can get a fully production-ready environment with all the capabilities I outlined above above in matter of minutes.
If you want to deploy your existing web application on the cloud, check out the deployment instructions in our Cloud Computing Framework documentation. This process is quite straightforward but nevertheless is still not as easy as it could be. If you want to try out the auto-scaling capabilities really quickly, using a demo application, follow the quick steps below:
1. Visit the mycloud page and get your free access code – this will give you access to the cloud.
2. Select the Trader-Stock-Demo – when you click the demo, the system will start provisioning the web application on the cloud. This will include a machine for managing the cloud, a machine for the load-balancer and machines for the web and Data Grid containers. After approximately 3 minutes, the application will be deployed completely. At this point you should see the “running” link on the load balancer machine. Click on this link to open your web client application.
3. Test auto-scaling – click multiple times on the “running” link to open more clients. This increases the load (requests/sec) on the system. As soon as the requests/sec grow beyond a certain threshold, you’ll see new machines being provisioned. After approximately 2 minutes, the machine will be running and a new web container will be auto-deployed into that machine. This new web container will be linked automatically with the load balancer, and the load balancer in turn will spread the load to this new machine. This will reduce the load on each of the existing servers.
4. Test self-healing – you can now kill one of the machines and see how your web client is behaving. You’ll see that even though the machine was killed, the client was hardly effected and the system scaled itself down automatically.
What’s going on behind the scenes?
All this may seem like magic to you. If you want to access the machines and watch the web containers, the Data Grid instances and the machines, as well as real-time statistics, click the Manage button. This opens up a management console that is running on the cloud, and can be accessed through the web. With this tool you can view all the components of our systems. You can even query the data using an SQL browser and view the data as it enters the system. In addition, you can choose to add more services, and even relocate services from one machine to another just by dragging and dropping them in the UI.
The next steps
As you can see,
the combination of middleware-virtualization and cloud computing can bring together the promise of right-sizing through auto-scaling. There are many things that can be done to make this entire process even simpler, not just for the demo application but for your own custom application.
At this point I’d really like to hear your feedback – in particular, I’d like to hear if you see areas that could be improved or areas that are missing in this approach.