Benchmarks

*Results unit numbers are microseconds.

 

Benchmark 20% Data Points 90% Data Points
Colocated Read/Write Latency 5 15
Colocated Notification Latency 10 230
Remote Execution Latency 200 340
Remote Notification Latency 300 930
Remote Replication Latency 180 320
*For further reading download GigaSpaces-Cisco Joint Solution white paper

NOTE: The benchmark environment is intended to provide ballpark figures and is not fully optimized for all scenarios. For performance optimization tips click here.

 

 

 

 

 

 

 

 

 

 

 

Possible Impossibility – The Race to Zero Latency

Published on December 6, 2010 by Shay Hassidim in Application Architecture, Application Performance, Benchmarks, Caching, Cloud, Data Grid, Development, Events, GigaSpaces, Java, JavaSpaces, OpenSpaces, sba, space-based architecture

I recently read a book called: "Physics of the Impossible" by the theoretical physicist Michio Kaku. Dr. Kaku lists "Possible Impossibilities" and classifies these into different categories where all these "impossibilities" may happen in the near/distant future.

When talking about "zero latency", I consider this something we can achieve today. The term "zero latency" is commonly used when discussing the overhead of the infrastructure used to run enterprise mission critical systems.

Looking at today's software based systems, running on current commodity hardware devices, we can see that systems can achieve relatively "zero latency" overhead when running with a shared memory based data store. I will describe with this post the reasons why and how we can do that today and present some results with recent benchmarks conducted.

 

Quick Results

Here are few quick results for whoever that is looking to get a ballpark numbers what were the results without diving into the details presented below:

 

The Application Building Blocks

Every system in a distributed environment or running in a stand alone setup performs the following basic activities:

  •  read or write data

  • calculate/process data

  • react to operations triggered by some other entity within the system

  • preserve state – to recover from some abnormal failure.

 

All the above activities can be implemented with a disk based media (database) or via memory based data store (IMDG).

How we will be implementing this with a Database?

With a database you will be using a JDBC driver, setting up a connection pool, generating relevant SQL statement, reading data from the database over the wire, mapping results to Objects, and probably using some transaction manger to coordinate the enter flow. In some situations you might be using stored procedures, triggers or even some messaging based systems (JMS) or parallel execution tools (hadoop). The usual classic multi-tier known architecture. Nothing you are not familiar with. Every data access routine will drag some data transformation, serialization and network calls. This is expensive and impacts the overall application end 2 end latency in a dramatic manner. 

How we will be implementing this with a IMDG?

Here is how the above basic activities will be performed using IMDG:

  •  read or write data – With IMDG this means using the write operation to push the data into a shared memory area to allow other clients to read it.

  • calculate/process data – With IMDG this means sending the business logic into the data (and not the sending the data into the business logic which can be very costly). This is called Task executor.

  • react to operations – With IMDG this means using notifications that are invoking a listener.

  • preserve state – With IMDG this means replicating the state of the primary copy instance of the IMDG into a backup copy running on a different physical machine and start new backup copies to ensure high- availability even after multiple system failures.

Space Operations
We will measure the latency of each of these operations (read/write, calculate, handle events, and then preserve results) with the GigaSpaces IMDG (aka space) and examine the results using latency histogram chart that will show us the exact data points distribution and outliers we had throughout the test.

Disk based vs. Shared Memory based Data Storage

The disk based solution will depend on the disk media speed and the usual physical boundaries of the magnetic based storage devices. These will suffer the inability to access different data items in the same time – a stumbling limitation that dramatically impacts their scalability. A typical Java application using a database (leveraging commodity disk devices) response time would be 20 ms when there is no load, up to several seconds (3-5 seconds) with a high load. This latency duration includes transforming objects into a relational model (ORM), transporting data over the wire, and replicating the transaction data from the primary database instance into a secondary hot backup instance located within the same LAN.

Relational databases are designed and optimized for commodity disk drives. Even if you run a relational database in-memory, the code and internal structures are not optimized for operation in an in-memory distributed environment.

There is some progress lately leveraging Solid-state drive (SSD) products, as alternatives to the commodity disk devices, but, still, relational databases are monolithic servers which can't operate in a distributed environment taking advantage of the entire network resources and scale in an elastic manner.

An IMDG, on the other hand, is designed to serve more clients at the same time and access more than a single data item at the same time for a given client. This is thankfully available due to the latest multi core CPUs, latest blazing fast memory chips, and the advances with the virtual machine garbage collection algorithms. An IMDG that is designed to run colocated business logic , running within the same memory space as the application, allows applications to access data without going through any remote calls. This also avoids serialization and temporary object creation delays that negatively impact latency. In addition, a good IMDG is elastic and designed to utilize network resources by expanding its capacity and execute operations simultaneously across all the IMDG nodes.

Well known Deterministic Behavior

When looking closely at financial, aviation, healthcare, telecom, and defense systems among others, they all share the same basic fundamental requirement: immediate response time. There is zero tolerance for a delayed response time. These systems require deterministic, predictable, precise well known response time. Sometimes, the negative affect a delay with these systems would be lost of money, and in some other cases it could lead into a loss of human lives. 

The Very Basic Business Logic Flow

Remember, every system predominantly goes through the same basic steps: consume data (1), process it into some other meaningful output (2) and later preserve this state to be used later on (3).  In many cases, step 2 includes hundreds of calculations, sometimes conducted serially where the entire process (1, 2 and 3) must be completed as one atomic operation – aka a transaction.

When looking at these fundamental operations, with a traditional relational database, steps one and three are usually more expensive, because communicating with the data store is slow. With an IMDG, however, phase two is by far the most expensive one, because the processes innately have the data available in their address space.

The expense then becomes a factor of the complexity of the business logic and the different entities that comprise the system. These could be the data model, the different type of users the system serves or the need to interact with third party systems. On average, such activity will consume 10-50 ms of the overall latency required to complete the flow. In some unique systems, high frequency algorithmic trading for instance, it could actually go below the 3 ms range.

The Results

With the following sections we will review results of a simple benchmark I've created that measure all the basic elements needed to implement the application building blocks mentioned above: read , write , calculate/process data (execute), react to operations (notify), and preserve the transaction state (replication).
 

Read/Write Latency

As we can see from the latency histograms below for write and read operations executed in a rate of 800 operations/second with an embedded space:

  •  20% of the measured latency data points for write and read operations were below 5 microseconds. In other words, maximum of 200,000 operations/sec by a single client thread.

  •  50% of the measured latency data points for write and read operations were below 7 microseconds.

  • 88% of the measured latency data points for write and read operations were below 10 microseconds.

  • 90% of the measured latency data points for write and read operations were below 15 microseconds.

  • 99% of the measured latency data points for read operations were below 18 microseconds.

  • 99% of the measured latency data points for write operations were below 22 microseconds. 

write-read latency

Execute Latency

A fundamental framework for distributed computing is the map/reduce pattern. It is a cornerstone for almost every system that allows it to run activities in parallel. It could be executed across multiple processes running on the same machine, or across different machines. The Task executor IMDG operation, inspired from the Java executor service and its Future concept, allows a client application to submit a Task to be called in an asynchronous manner. The result value could be consumed by the client application when desired.

The IMDG Executor operation latency histogram presented below demonstrates the relative low overhead observed when submitting an empty dummy task against a remote space. Together with the above read/write operations we can achieve constant low latency overhead when performing complex computations across large amount of IMDG partitions.

execute latency

  •  90% of the measured latency data points were below 340 microseconds. This works out to a maximum of roughly 3000 remote execute operations per second by a single thread.

  • 99% of the measured latency data points were below 600 microseconds. This yields a maximum of 1666 remote execute operations per second by a single thread.  

Notification Latency

Getting notifications about specific events that occur within the IMDG is also a relatively very fast activity. As we can see from the histogram below, most of the events are delivered from the IMDG to event handlers within  one milliseconds.  This  one ms duration includes the time it takes to identify the event within the IMDG, network latency, and the time it takes to transfer the event data (including serialization and de-serialization) from the IMDG process into the application process. The event sent to the client includes the data object that originated the notification. This allows the application to react and perform the necessary activities.

remote notify latency

  •  90% of the measured latency data points were below 930 microseconds.

  •  99% of the measured latency data points were below 1320 microseconds.

In some cases the notification is delivered from a colocated IMDG instance. In such a case, there is no network overhead or data transformation involved. As we can see from the histogram below the latency involved is dramatically lower:

embedded notify latency

  •  90% of the measured latency data points were below 230 microseconds. This gives a rough maximum of 4300 notifications per second per client thread.

  •  99% of the measured latency data points were below 340 microseconds. With this, there’s a maximum of 3000 notifications per second per client thread.

 

Synchronous Replication Latency

Saving the outcome of the calculation and the processing – i.e., preserving the state of the calculation -  is essential. Without it, the application will never be able to operate reliably. Replication of the data must happen as soon as the data is generated and committed in order to survive system failure. It can't be performed as a background activity. This means that once the system acknowledges that it has completed a business transaction (specific unit of work), its state must be preserved in at least two different physical locations. To achieve this, the IMDG must use synchronous replication approach to mirror the changes performed in the primary instance with a backup copy hosted within a different VM than the primary.

The overhead involved to replicate the state of an average transaction is relatively very low (under 320 microseconds in most cases). With that kind of an overhead, most business logic activities will easily have less than one millisecond latency for a full completion. Additional concurrent users or large amount of data will not impact these numbers since we will break the IMDG into logical partitions where each will have its own dedicated replication engine.  

 

replication latency

  • 90% of the measured latency data points were below 320 microseconds. Maximum of 3100 replication events per second per thread.

  • 99% of the measured latency data points were below 480 microseconds. Maximum of 2100 replication events per second per thread. 

Conclusion 

When taking into consideration the average time is takes to consume data from the IMDG, the time it takes the business logic to process the incoming data and digest it into meaningful output data, and later to preserve it to survive system failures, we see that the actual impact of the IMDG on the overall business transaction latency is between 2-10 percent (around one millisecond). This is a constant overhead regardless the IMDG size, application size or the data model complexity.

Comparing this overhead to a file based system where the overhead starts with 20ms and can end up with 5000 ms latency (3 orders of magnitude larger than IMDG!), we can say we have almost zero latency overhead with the IMDG. Thus, the zero latency "impossibility" is now a reality!


 

 

 

 

Space Based Architecture (SBA) vs Tier Based Implementation

This benchmark compared a typical OLTP application implemented with classic Tier based approach using one of the existing JEE products with Space Based Architecture implemented on top of GigaSpaces XAP. In both cases we used the exact same application code. The difference between the tests was the underlying implementation of the messaging and data layers and the way the application layers were assembled. Learning from this experience we summarized in this whitepaper the steps required to write your JEE application in a way that could exploit the benefit of SBA without locking your code to any specific implementation. The benchmark result of this test is outlined below.

Results

GigaSpaces XAP proved far superior to the traditional Tier-based implementation, in terms of throughput and latency. The results on the same hardware (quad-core AMD machines with RH Linux) were 6 times more throughput, with up to 10 times less latency in favor of the GigaSpaces implementation, as depicted in the below graphs:

 

 
 

Benchmark Summary

In many benchmarks we used to measure each tier individually i.e. messaging throughput/latency, data layer throughput and latency, however when we bring all those pieces together in our application we are surprised that the throughput and latency does not come close to those results. The reason is that with end to end measurement your as strong as your weakest link. In the tier based approach the overhead of integrating the data, business-logic and messaging together, keeping each one of them highly available and ensuring transaction consistency using two-phase-commit can bring any system to its knees. With Space Based Architecture, we were able to reduce this overhead by collocating the tiers and by breaking the entire application into partitions each dealing with certain segment of the data. We also used common In-Memory clustering for the messaging, data and business logic which enabled us to reduce the amount of moving parts and avoid the need for 2pc transactions without losing consistency. The result of this benchmark shows the significant impact of measuring the end-to-end performance and latency and how far Space Based Architecture can lead to better throughput and latency then of the Tier based alternative. The test was conducted on a relatively small scale environment. The difference between the two approaches is expected to be even more significant with proportion to the scale.

 

 

 

 

Web Application Scalability

The benchmark that we conducted used a classic eCommerce application (Pet Clinic) on top of web application support in the GigaSpaces XAP application server and measured the number of pages/sec that were generated when we increased the number of concurrent users.

The goal of this benchmark is to measure what would be a cost-effective HW and Software architecture that could leverage new commodity multi-core HW for enabling efficient scaling of web applications while minimizing the amount of physical machine deployment.

The diagram below shows the architecture we used for this test application.

 

 

 

As can be seen in the above diagram we used an apache load-balance up to 3 web containers, MySQL as the database and GigaSpaces data-grid to front-end the database. We used a Map/Reduce pattern for querying the entire data sets and get aggregated results.

The physical deployment in terms of the HW environment appeared as follows:
3 server boxes with 24 cores each

 

Results

  • Page views/sec - 1.4Billion page views a day (16,000 page view/sec)
  • Latency - 6 msec (in LAN environment)

Benchmark Summary

This benchmark showed that using relatively inexpensive HW and a small number of machines the combination of scale-up as well as scale-out approach provides a very cost effective solution for scaling of web applications. Adding more web servers dynamically (Using the GigaSpaces dynamic web scaling integration) enabled to curve out the latency under load while simultaneously increasing the capacity of pages that can be served.

 

Raw Performance of Space Operations in Java and .Net

The below benchmarks deal with the raw performance of basic space operations, i.e. read, write, take and notify. The first benchmark runs on a remote space where the space and client run on two separate machines. The second benchmark runs on an embedded space where the space and client co-exist within the same process.

Results

Java Results

In these tests we used Sun4450 Intel 7460 CPU, 1G Network, 1K payload, Single operation, No special JVM tuning

  Write Read Take
Remote-with replication-50 client threads 45K/sec 90K/sec 45K/sec
Embedded-without replication-20 client threads 1.1 million/sec 1.8 million/sec 1.1 million/sec
.Net results

Benchmark Summary

The results show that even a single remote space instance can be fairly scalable and handle large volume of requests/sec. Embedded space results show that collocations leads to a significant increase of throughput. This indicates that we can easily write the same application code for both remote and local operations and benefit from collocation implicitly without the need to go through complete different implementations.

.Net results on a remote scenario is fairly close to those of Java. In embedded mode we see a significant gain in performance as expected. However, the gain in .NET is lower than the gain in Java due to the fact that read and write operation requirements are not passed by copy or reference as with Java. .Net native local-cache provides more then 1M reads/sec, which is relatively close to native .Net operation. In a real life situation which each embedded update would need to go through replication for backup purposes, it is expected that the performance between Java and .Net will be very close.

 

 

 

Scale-up on Multi-Core Benchmark

Building applications in a way that would enable to exploit Multi-Core technology requires specific a skill-set for parallel programming rarely available by typical organizations. It is also considered a conflicting paradigm to a scale-out model. The following test was conducted on a Sun T5240 Sparc machine. Similar tests were also done on Sun Fire x4450-Intel. To exploit the power of multi-core technologies we used the built-in parallel processing semantics available with GigaSpaces middleware which enabled both scaling-up and out at the same time without enforcing changes to the application code.

Figure 1T5240 Scale-up performance gain

Benchmark Summary

The GigaSpaces parallel processing API provides a Java equivalent programming model such as the Actor model to those available in languages such as Scala and Erland. The combination of all that enables to exploit the full potential of multi-core power and provides unparalleled performance at significantly reduced power consumption without exposing the complexity associated with such optimization to the application and without the need to introduce a completely new language for that purpose.

Seamless transition from scale-up to scale-out.

The user application is connected to the space using a GigaSpaces smart proxy. The smart proxy detects whether the space implementation is remote or local and uses either direct call in case it is local or network call otherwise. The application code is kept the same in both instances which makes it possible to design the application for both dimensions of scalability simultaneously or switch between scaling-up or out models at any point in time through simple configuration change.

To learn more about GigaSpaces parallel programming API and Actor model visit the following blog post: Actor Model and Data Grids

 

 

 

Intel Benchmark Shows XAP 7 is 300% Faster on Nehalem Processor

GigaSpaces XAP 7.0 running on Intel's Xeon 5500 Nehalem processor, has been shown to be 300% faster than XAP 6.0 on the best previous Intel processor. The benchmark was run by the fasterAPPS program, a joint initiative by Intel, MPI Europe and Globant, aimed at encouraging migration of financial applications to the latest multi-core technology.

XAP 7.0 - Optimized for Multi-Core

GigaSpaces XAP 7.0 has been specifically optimized to take advantage of multi-core environments, allowing highly multithreaded applications to run as efficiently as possible with the least possible resource contention. Specifically, XAP's in-memory transaction and locking mechanisms have been refactored to use more lightweight locking and synchronization constructs supported by modern processors and later versions of the Java virtual machine. This in turn helped us achieve significant performance gains, as shown in this benchmark.

Benchmark Results Summary

GigaSpaces XAP 7.0 was benchmarked on Intel’s latest Xeon 5500 Nehalem processor and achieved the following results:

  • 1 million write operations/sec (embedded mode) on one machine - "write operations" are data updates.
  • 2.6 million read operations/sec (embedded mode) on one machine - "read operations" are data retrievals.
  • 360% boost in write performance and 570% boost in read performance (embedded mode) compared to XAP 6.0 running on the previous best Intel processor, which achieved 276K writes/sec and 453K reads/sec.
  • 90K read operations/sec and 40K write operations/sec (remote access) on a partitioned cluster of 3 Nehalem machines, scaling up to 16 threads with near-linear scalability.
  • 30% better scalability than XAP 6.0 on the best previous Intel processor.

Note: Running XAP in embedded (local) mode is an order of magnitude faster than accessing it remotely. Users of XAP have the option of running data collocated with business logic (Space-Based Architecture), to enable local access for all transactions, making the high performance figure relevant for most real-life applications.

Benchmark Figures for Embedded (1 machine) vs. Remote Partitioned Cluster (3 machines)

Benchmark Configuration

  • The single embedded space scenario was conducted on a single server
  • the partitioned remote space scenario was conducted on a cluster of 3 servers
Server Specifications
  • CPU: Two Nehalem 2.80GHz C0 step processors
  • Memory: 24GB DDR3 1333MHz memory
  • Speed Step and Hyper Threading: off
  • Network interface: Mellanox ConnectX QDR Infiniband, connected to Mellanox switch
  • Operating System: RHEL 5
  • Java Version: Sun JVM 1.6.0_12
  • GigaSpaces Version: XAP Premium 7.0.1 GA (build 3818)

 

 

 

Benchmarks