Data Grid FAQ

Frequently Asked Questions:

 



  • How does an in-memory data grid improve the performance, scalability and reliability of a relational database (RDBMS)?
    The fundamental problems with both database replication and database partitioning is the reliance on the performance of the file system/disk and the complexity involved in setting up database clusters. No matter how you turn it around, file systems are fairly ineffective when it comes to concurrency and scaling. This is pure physics: disk storage suffers severe latency because each data access must go through serialization/de-serialization, as well as mapping from binary format to a usable format. This puts hard limits on latency. In addition, latency is often severely affected by lack of scalability. So putting the two together makes file systems -- and databases, which heavily rely on them -- suffer from limited performance and scalability.

    These database patterns evolved under the assumption that memory is scarce and expensive, and that network bandwidth is a bottleneck. Today, memory resources are abundant and available at a relatively low cost. So is bandwidth. These two facts allow us to do things differently than we used to, when file systems were the only economically feasible option.

    GigaSpaces takes advantage of this by managing data (including transactional data) as objects in-memory and collocated with the application business logic (running within the same process). This significantly reduces latency. It also allows for better scalability, as the data can be easily partitioned across nodes that have no dependency on each other (each processes a sub-set of the data). Finally, reliability is achieved by maintaining active hot backups of each partition, which can take over instantly upon the failure of the primary node (fail-over).

  • If GigaSpaces EDG synchronizes with a relational database, doesn't that mean that performance is limited?
    No. Because:

  • Data is sent from memory to the database asynchronously and in batches
  • Updates to the database are performed in parallel by all partitions
  • Updates to the database are executed in the same machine as the database through the GigaSpaces Mirror Service. This allows reducing network overhead, as well as benefiting from optimizations, such as batch operations
  • The database is not used for high availability purposes. This means that in-flight transactions are not stored in the database, only the end result of the business transaction. This, in turn, reduces the amount of updates sent to the underlying database. Also, queries don't hit the database, only updates and inserts. All of this combined means that the in-memory data grid (IMDG) acts as a smart buffer to the database. It is common that the number of reads/updates the IMDG receives is 10x higher than the number of hits on the underlying database
    With GigaSpaces EDG, the database and the application are decoupled, enabling more options for optimization. For example, there are scenarios where writing to the database is required to ensure the durability of the data. In this scenario, the data is stored directly in a persistent log (to ensure durability). The log can be updated at a relatively high rate. Data is read from the persistent log back into the database as an off-line operation. With this approach, update rates can easily reach 30,000 to 40,000 per second with a single low-end database instance (such as MySQL). If this is insufficient, database instances can be clustered for faster database access.
  • Doesn't asynchronous replication to the database mean that data might be lost in case of failure?
    No, because asynchronous replication refers to the transfer of data between the in-memory data grid (IMDG) and the database. The IMDG, however, maintains in-memory backups that are synchronously updated. If one of the nodes in a partitioned cluster fails before the replication to the underlying database took place, its backup will be able to instantly continue from that exact point.
  • What happens if one of the in-memory data grid partitions fails?
    The backup partition takes over and becomes the primary. GigaSpaces EDG re-directs the failed operation to the hot backup implicitly. This enables a smooth transition of the client application during failure -- as if nothing happened. Each primary node may have multiple backups to further reduce the chance of total failure. In addition, the cluster manager component detects failure and provisions a new backup instance on one of the available machines.
  • What happens if the database fails?
    The in-memory data grid (IMDG) maintains a log of all updates and can re-play them as soon as the database becomes available again. It is important to note that during this time the system continues to operate unaffected. The end user will not notice this failure!
  • How do I maintain transactional integrity with GigaSpaces EDG?
    GigaSpaces EDG supports the standard two-phase commit protocol and XA transactions. Having said that, this model should be avoided as much as possible due to the fact that it introduces dependency among multiple partitions, as well as creates a single point of distributed synchronization in the system. Using a classic distributed transaction model doesn't take advantage of the full linear scalability potential of the partitioned topology offered by GigaSpaces EDG. Instead, the recommended approach is to break transactions into small, loosely-coupled services, each of which can be resolved within a single partition. Each partition can maintain transaction integrity using local transactions. This model ensures that in partial failure scenarios the system is kept in a consistent state.
  • How is transactional integrity with the database maintained?
    As noted above, distributed transactions might introduce a severe performance and scalability bottleneck, especially if performed with the database as the system of record. In addition, attempting to execute transactions with the database violates one of the core principles behind the GigaSpaces Persistence as a Service (PaaS) approach: asynchronous updates to the database. To avoid this overhead, the GigaSpaces in-memory data grid (IMDG) ensures that transactions are resolved purely in-memory and are sent to the database in a single batch. If the update to the database fails, the system will re-try the operation until it succeeds.
  • What types of queries are supported in GigaSpaces EDG?
  • Template matching (matching object based on class name, class hierarchy, and attribute values)
  • SQL - supports range queries, 'like' semantics, etc.
  • Continuous queries - through a combination of notification and SQL.
  • Parallel query (a.k.a Map/Reduce) - queries that are not designated to a specific partition are automatically broadcasted to all partitions and the result is implicitly aggregated on the client side
  • Iterator - iterates through a large result-set of data
  • Code snippets of the different query APIs are available here

  • This model relies heavily on partitioning. How do I handle queries that need to span multiple partitions?
    Aggregated queries are executed in parallel on all partitions. You can combine this model with stored procedure-like queries to perform more advanced manipulations, such as sum and max. See more details below.
  • What about stored procedures and prepared statements?
    Because the data is stored in memory, we avoid the use of a proprietary language for stored procedures. Instead, we can use either native Java/.Net/C++ or dynamic languages, such as Groovy and JRuby, to manipulate the data in memory. The IMDG provides native support for executing dynamic languages, routes the query to where the data resides, and enables aggregation of the results back to the client. A reducer can be invoked on the client-side to execute second level aggregation. A code example that illustrates how this works can be found here. You may also review the Scripting documentation.
  • Can these prepared statements and stored procedure equivalents be changed without bringing down the data?
    Yes. When you change the script, the script is reloaded to the server while the server is up without bringing down the data. The same capability exists in case you need to re-fresh collocated services code on the server-side.
  • Do I need to change my application code to use GigaSpaces EDG?
    There are cases in which introducing GigaSpaces EDG's in-memory data grid is completely seamless and there are cases in which you will need to go through a re-write, depending on the programming model:

Nature of Integration with GigaSpaces EDG

Comments/limitations

Hibernate 2nd level cache

Seamless

Best fit for read-mostly applications. Limited performance gain as it still heavily relies on the underlying database.

JDBC

Seamless, but limited

SQL commands written to the in-memory data grid are guaranteed to run with other JDBC resources. Doesn't support full SQL 92 and therefore existing applications may require code changes. Recommended for monitoring and administration. Not recommended for application development as it introduces unnecessary O/R mapping complexity.

HashMap

Seamless

Extensions such as timeout and transaction support are available.

GigaSpaces Spring DAO

Partially seamless

Abstracts transaction handling from the code. Domain model is based on POJOs, and therefore, doesn't require explicit changes, only annotations (annotation can be provided through an external XML file). If the application already uses a DAO pattern then it would require changing the DAO. This allows narrowing down the scope of changes required to use an IMDG-specific interface. This option is highly recommended for best performance and scalability.

  • What topologies are supported by GigaSpaces EDG?
    Replicated (synchronous or asynchronous), partitioned, partitioned-with-backup.
    See details here
  • Does code need to be changed when switching from one topology to another?
    No. The topology is abstracted from the application code. The only caveat is that your code needs to be implemented with partitioning in mind, i.e., moving from a central server or a replicated topology to partitioning doesn't require changes to the code as long as your data includes an attribute that acts as a space routing index
  • How are in-memory data grids (IMDG) and Persistence-as-a-Service (PaaS) different from in-memory databases (IMDB)?
    An IMDG allows storing objects in memory while maintaining a relational model. However, using in-memory storage in an IMDG, eliminates the need for an object-relational mapping (ORM) layer. In addition, we don't need separate languages to perform data manipulation. We can use the native application code, or dynamic languages.

    Moreover, one of the fundamental problems with in-memory databases is that relational SQL semantics are not geared to deal with distributed data models. For example, an application that runs on a central server and uses statements like Join, which often maintain references among tables, or even use aggregated queries such as Sum and Max, doesn't map well to a distributed data model. This is why many IMDB implementations only support very basic topologies and often require significant changes to the data schema and application code. This reduces the motivation for using in-memory relational databases, as it lacks transparency.

    The GigaSpaces in-memory data grid implementation exposes a JDBC interface and provides SQL query support. Applications can therefore benefit from the best of both worlds: you can read and write objects directly through the GigaSpaces API, query those same objects using SQL semantics, and view and manipulate the entire data set using regular database viewers.

  • Can I use existing Hibernate mapping to map data from the database to the GigaSpaces in-memory data grid (IMDG)?
    Yes. In addition, with GigaSpaces' Persistence-as-a-Service (PaaS) feature, Hibernate mapping overhead is significantly reduced, as most of it happens in the background, during initial load or during the asynchronous update to the database.

    Further information about Hibernate support is available here