Last week I had the pleasure of spending half a day with the senior architects of one of the leading infrastructure software vendors. They are looking into embedding our product as a caching solution (aka Data Grid) for their next generation product lines. When we went through the list of requirements with the different architects of each group, they describe their needs as follows:
- We need a generic clustering model that will enable our servers to share state in-memory so that they will look and behave as one single cluster to the applications using them.
- We need hot fail-over of our servers without persisting the data to disk
- We need to achieve true linear scalability
- We need to scale over a WAN between multiple sites.
- We need to have the ability to receive notifications when changes occur in the cluster
In addition, they said they need a flavor of the solution bundled as part of their stack to enable user applications written on their platform to leverage that same cluster to maintain application high availability and scalability in a consistent way with their servers.
As the conversation progressed, it was clear that they are looking for something much bigger than what they originally said (a caching, or Data Grid, solution). They where actually looking for a solution that will address the middleware and the application scalability in a consistent way.
I had similar discussions last week with a different company, this one building a global order book application. When we started talking a couple of months ago, they initially described their requirements as a simple caching solution to improve performance.
A few weeks later they had another realization: It needs to work well over a WAN to reduce the network latency associated with accessing shared data between their geographically distributed applications.
And just recently they came to another realization: “We also need to make sure that our entire application is highly available and scalable, and is consistent with the high availability and scalability of our data”.
They were now open to move away from their proprietary infrastructure that they had developed in-house because they came to the conclusion that:
- Their existing implementation is a bottleneck because it doesn’t scale well. The original design was based on a passive-active approach, and relied on database clustering as the way to achieve high-availability. Obviously, the performance of such an approach is limited by they underlying database. This problem gets worse when you add scaling requirements.
- It’s not part of their core business. They realized that they spent a disproportionately large part of their time and money building infrastructure rather than building unique features that will add value to their customers.
As with the software vendor I described above, they started with the assumption that they had a bottleneck in their data-tier, and that the solution is caching/data grid. As they developed a deeper understanding of the issue, it turned out to be an architecture bottleneck, which led them to look at a more complete middleware solution that will address those various needs.
This week in Las Vegas I had a chance to meet one of the biggest casino companies, which also follows a very similar pattern.
They started an evaluation of our software as a way to improve the latency of their online ordering system. After they conducted a benchmark with GigaSpaces, they achieved 20 times performance improvement in the response time of their existing application, and not surprisingly, seemed to be very happy with those results.
When we got together to discuss the results of the PoC they described the challenges of their application in the following way:
We want to move from a monolithic server approach to a service-oriented approach, but we’re facing lots of complexity due to the amount of moving parts. It is also very hard to achieve reliability in such an environment, where so many components are flying around the network.
One of their big concerns was the reliability of their online systems and the complexity associated with maintaining them with the existing SOA approach. Once again, we started by solving the data bottleneck but very quickly got the point where we were looking at their entire architecture and realized that the requirements are much broader: scalability, latency, reliability, and reducing the complexity of their entire application, not just the data-tier.
These requirements are very closely related to each other, as in many cases what drives the complexity is the way one handles scalability. In their case, they realized that SOA is the right approach to achieve scalability, but as we wll know, SOA is just a concept and not a solution on its own. Most of the existing SOA solutions tend to make things even more complex in the way they handle the distribution of the service because of the fact that they ignore the realization that services are stateful and that breaking your application into network components may address the separation of concerns, but introduces a whole set of other issues related to how those service components maintain their state and communicate without adding performance overhead, complexity and reliability due the increase of moving parts in the network.
These are only three examples from the past week that I was involved with!
So to answer the question “When do you need more then just a Data Grid?”
The answer is simple – once you realize that you need to deal with scaling, performance and reliability of your entire application not just of your data.
While Application Servers should have provided a solution to those challenges, most of the existing J2EE-based implementations are still lagging behind. One of the challenges that we are addressing today with a combination of Spring and Space Based Architecture is to create a light weight and simple application server stack that will fulfill that promise.