There are a couple of basic rules that you need to follow if you want to achieve true linear scalability in a stateful environment:
- You need to reduce contention on the same data source
- You need to remove dependency between the various units in your application. You can only achieve linear scalability if each unit of work is self-sufficient and shares nothing with other units.
These are basic principles of scalability. A common pattern to achieve these two principles in a stateful environment is through partitioning, i.e., you break your application into units of work (Processing-Units in GigaSpaces terminology), with each unit handling a specific sub-set of your application data. You can then scale by simply adding more Processing Units.
It is a common misconception to think that partitioning of the data doesn’t address aggregation. Many still believe that you can only achieve aggregation by storing your data in a centralized location. Google is actually a very good example of the opposite. You wouldn’t expect that all the data in the Worldwide Web will be stored in a centralized location, right? Even without knowing the exact details of their underlying implementation, you can imagine that they don’t store it in a centralized place. You can also imagine that if you don’t store it in a centralized place most likely you partition it one way or another. Yet if you look at the speed in which your getting results out of your Google queries you can appreciate the speed in which your queries are being processed.
Let me try to explain why:
How can aggregated queries run faster on partitioned topology?
- You parallelize the query such that each query runs against each partition. By doing so you leverage the CPU and Memory capacity in each partition to truly parallelize your request. Note that the client issuing the queries can be unaware of the physical separation of the partitions and get the aggregated results as if it was running against a single gigantic data store with one main difference – it will be much faster!
- The query can run collocated with data in each partition, and therefore, quickly perform very complex tasks that typically require a lot of data to traverse — without moving the data back and forth.
- Each partition contains a smaller data-set therefore you reduce the contention on the data. As a result, queries per partition are more effective.
- The data is stored in-memory. In-memory data storage is far more efficient then disks, especially when it comes to concurrent access.
It is true that the limiting factor of scalability in this model is the affinity-key. The data affinity determines the level of granularity in which data can be partitioned. It doesn’t make sense to partition data at the finest granularity if for each query I’ll need to go through the network to grab other data elements required to fulfill the query. In some cases the affinity-key can be determined by customer-id, or session-id or by some other criteria. The definition of that key is application specific and cannot be applied implicitly without prior design and analysis of how your application uses the data.
For example, work that we at GigaSpaces did with a large billing software company required that the application find whether a customer has enough money on his account to fulfill a transaction, or identify whether this is a “bad” customer based on his/her profile. Some of the queries in this application would map directly into a single partition based on the customer-id, however, some other queries such as “sum of the current balance in all the customer’s accounts” had to map into aggregated queries, because a customer can have multiple accounts. We loaded close to 1 terabyte of data into memory. We easily showed in that case that we could run that query 10 times faster then the original implementation, and we were able to demonstrate that the results were consistent regardless of the size of the data, and even the amount of concurrent users accessing the system.
One of the main benefits for this billing software company was that they didn’t have to sacrifice the entire application architecture and scalability just because some of their queries required aggregation. Interestingly enough it was also proven that this architecture provided significantly better performance versus the previous centralized approach.