Enterprise Data Grid Tutorial A - Basic Topologies

  Search Here
Searching XAP 6.0 Documentation

                                               



This tutorial contains a client application that runs on GigaSpaces 6.0. You must have GigaSpaces 6.0 installed before proceeding. You can download the product here.
Both Java and .NET implementations are provided:

This icon specifies instructions relevant only for Java.
This icon specifies instructions relevant only for .NET.

Overview


Different applications might have different caching requirements. Some applications require on-demand loading from a remote cache, due to limited memory; others use the cache for read-mostly purposes; transactional applications need a cache that handles both write and read operations and maintains consistency.

In order to address these different requirements, GigaSpaces provides an In-Memory Data Grid that is policy-driven. Most of the policies do not affect the actual application code, but rather affect the way each Data Grid instance interacts with other instances. The policies allow the Data Grid to be configured in almost any topology; most common topologies are predefined in the GigaSpaces product and do not require editing policies.

In this tutorial, you will use GigaSpaces to implement a simple application that writes and retrieves user accounts from the GigaSpaces In-Memory Data Grid, clustered in the most common topologies - replicated, partitioned, master-local and local-view. The application will either actively read data or ask to be notified when data is written to or modified in the Data Grid.

GigaSpaces Data Grid - Basic Terms

  • Data Grid instance - an independent data storage unit, also called a cache. The Data Grid is comprised of all the Data Grid instances running on the network.

  • Space - a distributed, shared, memory-based repository for objects. A space runs in a space container - this is usually transparent to the developer. In GigaSpaces each Data Grid instance is implemented as a space, and the Data Grid is implemented as a cluster of spaces organized in one of several predefined topologies.

  • Grid Service Container - a generic container that can run one or more space instances (together with their space containers) and other services. This container is launched on each machine that participates in the Data Grid, and hosts the Data Grid instances.

  • Replication - a relationship in which data is copied between two or more Data Grid instances, with the aim of having the same data in some or all of them.

  • Syncronous replication - replication in which applications using the Data Grid are blocked until their changes are propagated to all Data Grid instances. This guarantees that everyone sees the same data, but reduces performance.
  • Asyncronous replication - replication in which changes are propagated to Data Grid instances in the background; applications do not have to wait for their changes to be propagated. Asynchronous replication does not negatively effect performance, but on the other hand, changes are not instantly available to everyone.
  • Partitioning - new data or operations on data are routed to one of several Data Grid instances (partitions). Each Data Grid instance holds a subset of the data, with no overlap. Partitioning is done according to an index field in the data - operations are routed to partitions based on the value of this field.

  • Topology - a specific configuration of Data Grid instances. For example, a replicated topology is a configuration in which some or all Data Grid instances replicate data between them. In GigaSpaces, Data Grid topologies are defined by cluster policies (explained in the following section).
  • Reading - one way to retrieve data from the Data Grid, which will be used in this tutorial, is to call the space read operation, supplying a read template object which specifies what needs to be read.
  • Notifications - GigaSpaces allows applications to be notified when changes are made to objects in the Data Grid. Applications register in advance to be notified about specific events. When these events occur, a notification is triggered on the application, which delivers the actual data that triggered the event.

GigaSpaces Clustering Concepts

In GigaSpaces, a cluster is a grouping of several spaces running in one or more containers. For an application trying to access data, the cluster appears as one space, but in fact consists of several spaces which may be distributed across several physical machines. The spaces in the cluster are also called cluster members.

A cluster group is a logical collection of cluster members, which defines how these members interact. The only way to define relationships between clustered spaces in GigaSpaces, is to add them to a group and define policies. A cluster can contain several, possibly overlapping groups, each of which defines some relations between some cluster members - this provides much flexibility in cluster configuration.

A GigaSpaces cluster group can have one or more of the following policies:

  • Replication Policy - defines replication between two or more spaces in the cluster, and replication options such as synchronous/asynchronous and replication direction.
  • Load Balancing Policy - because user requests are submitted to the entire cluster, there is a need to distribute the requests between cluster members. The load balancing policy defines an algorithm according to which requests are routed to different members. For example, in a replicated topology, requests are divided evenly between cluster members; in a partitioned topology they are routed according to the partitioning key.
  • Failover Policy - defines what happens when a cluster member fails. Operations on the cluster member can be transparently routed to another member in the group, or to another cluster group.

A cluster schema is an XML file which defines a cluster - the cluster name, which spaces are included in the cluster, which groups are defined on them, and which policies are defined for each group. GigaSpaces provides predefined cluster schemas for all common cluster topologies. Each topology is a certain combination of replication, load balancing and failover policies.

Data Grid Topologies Shown in this Tutorial

Topology and Description Common Use Options
Replicated (view diagram)
Two or more space instances with replication between them.
Allowing two or more applications to work with their own dedicated data store, while working on the same data as the other applications.
  • Replication can be synchronous (slower but guarantees consistency) or asynchronous (fast but less reliable, as it does not guarantee identical content).
  • Space instances can run within the application (embedded - allows faster read access) or as a separate process (remote - allows multiple applications to use the space, easier management).
  • In this tutorial: two remote spaces, synchronous replication.
Partitioned (view diagram)
Data and operations are split between two spaces (partitions) according to an index field defined in the data. An algorithm, defined in the Load-Balancing Policy, maps values of the index field to specific partitions.
Allows the In-Memory Data Grid to hold a large volume of data, even if it is larger than the memory of a single machine, by splitting the data into several partitions.
  • Several routing algorithms to chose from.
  • With/without backup space for each partition.
  • In this tutorial: Two spaces, hash-based routing, with backup.
Master-Local (view diagram)
Each application has a lightweight, embedded cache, which is initially empty. The first time data is read, it is loaded from a master cache to the local cache (lazy load); the next time the same data is read, it is loaded quickly from the local cache. Later on data is either updated from the master or evicted from the cache.

Boosting read performance for frequently used data. A useful rule of thumb is to use a local cache when over 80% of all operations are read operations.
  • The master cache can be clustered in any of the other topologies: replicated, partitioned, etc.
  • In this tutorial: The master cache comprises two spaces in a partitioned topology.
Local-View (view diagram)
Similar to master-local, except that data is pushed to the local cache. The application defines a filter, using a spaces read template or an SQL query, and data matching the filter is streamed to the cache from the master cache.
Achieving maximal read performance for a predetermined subset of data.
  • The master cache can be clustered in any of the other topologies: replicated, partitioned, etc.
  • In this tutorial: The master cache comprises two spaces in a partitioned topology.

The topologies above are provided in the GigaSpaces product as predefined cluster schemas. Schemas can be found in <GigaSpaces Root>\config\schemas. The schema names are:

  • Synchronous replication - sync_replicated-cluster-schema.xsl
  • Partitioned with backup - partitioned-sync2backup-cluster-schema.xsl
    The master-local and local-view topologies do not need their own schemas, because the local cache is defined on the client side.

Deploying the Data Grid


Now that you have a little background about the GigaSpaces Data Grid and the topologies used in this tutorial, the first step is to deploy the Data Grid.

To deploy the Data Grid instances, you will first launch two GigaSpaces Grid Service Containers (generic containers that can run Data Grid instances) on the same machine. Each container will host one cluster node. In real life, each cluster node usually runs on a different physical machine.

Then, using the GigaSpaces Management Center (GS-UI), you will launch two spaces, clustered together according to one of the Data Grid topologies discussed above.

Start by choosing the Data Grid topology that interests you most, and launching it using the instructions below. After you start the client application and test this topology (as described in the following sections), you can return to this section, deploy another topology, and try it out as well.


To run the Grid Service Containers:

  1. Start a Grid Service Manager, which manages the containers, by executing <GigaSpaces Root>\bin\gsm.bat (or .sh).
  2. Start a Grid Service Container by executing <GigaSpaces Root>\bin\gsc.bat (or .sh).
  3. Start another Grid Service Container by executing gsc.bat (or .sh) again.

To deploy the Data Grid:

  1. Wait until the two Grid Service Containers finish loading and register with the Grid Service Manager. When this has happened, the execution window of each container (GSC) shows the following message:
    [time]
    INFO: Registered to a ProvisionManager
    [time]
    CONFIG [com.gigaspaces.grid.gsc]: Loading [0] initialServices
    

  2. Start the GS-UI, by executing <GigaSpaces Root>\bin\gs-ui.bat (or .sh).
  3. From the tabs on the left, select Deployments, Details.
  4. On the toolbar at the top, click the New Deployment ( ) button (This button is enabled only if a Grid Service Manager is detected, and at least one Grid Service Container is detected and registered to the Grid Service Manager). This is how you deploy a new space or cluster on the GigaSpaces containers.



    The Deployment Wizard is displayed:



    Select Enterprise Data Grid as shown above and click the Next button.
    The following page showing the Data Grid attribute fields is displayed:



  5. In the Data Grid Name field, type the name myDataGrid as shown above. This name represents the Data Grid you are deploying in the GS-UI. This name will be given to all spaces in the cluster. Remember this space name - you will use it when running the client application and connecting to the Data Grid.
  6. In the Space Schema field, leave the space schema as default. This field allows you to specify whether the space instances in the cluster should be persistent (data automatically persisted to a database) or not. You will not use persistency in this tutorial.
  7. In this page of the wizard you will define the Data Grid topology by filling the Cluster Info area, do one of the following:
    • If you want to deploy the Data Grid in a replicated topology, From the *Cluster schema drop-down menu, select the sync_replicated option. This option uses the sync_replicated-cluster-schema, which has synchronous replication between all cluster members. This option refers to a single space or a cluster of spaces (in one of several common topologies) with no backup.
      • Select the number of spaces (Data Grid instances) in your replicated cluster. Deploy a cluster with 2 spaces, by typing the number 2 into Number of Instances field.
        The following shows the settings for the replicated topology:



    • If you want one of the other topologies, partitioned, master-local or local-view, from the Cluster schema drop-down menu, select the partitioned option. This option refers to a single space with a backup, or a partitioned cluster of spaces with backups.
      • You need to select the number of partitions. Specify two partitions by typing 2 into the Number of Instances field. This option uses the partitioned-cluster-schema. Specify one backup for each partition, by typing 1 into the Number of backups field. When using the partitioned cluster with backups the cluster schema used is the partitioned-sync2backup-cluster-schema.
        The following shows the settings for the partitioned (with backup) topology:



    • For both topologies you need to select a Grid Service Manager (GSM) for deployment from the table placed in the bottom area of the page.
      The table might include more than one Grid Service Manager. If so, look for the specific manager you launched - you can find it according to the Machine field (look for the machine on which you ran the Grid Service Manager). Click your Grid Service Manager to select it.



  8. Click Deploy to deploy the cluster. Deployment status is displayed (Here for the two replicated Data Grid instances):



    In the master-local and local-view topologies, the master cache can in principle be clustered in any topology - partitioned, replicated, etc. (or can be a single space). The master-local/local-view aspect of the topology is specified on the client side: when the client connects to the cluster or space (the master cache), it specifies if it wants to start a local cache and how this cache should operate.



    Depending on the type of deployment you performed, you should see that either two spaces (two replicated Data Grid instances) or four spaces (two Data Grid partitions with one backup each) were provisioned to the host running the Grid Service Containers.
  9. If this is not the first topology you are deploying, and you are already familiar with the client application, skip to Running Client, Testing Notifications and Verifying Topologies.

    You deployed the the Data Grid using the GS-UI and its Deployment Wizard. An alternative way to deploy is to start the cluster manually, by executing the gsInstance script (<GigaSpaces Root>\bin\gsInstance.bat or .sh). Manual deployment requires the use of Space URLs, which might take different arguments for different topologies.

    For more details on deploying a cluster manually, refer to Space URL.


The Client Application


In this tutorial, we provide a sample application that consists of the following components:

  • A Data Loader that writes data to the Data Grid.
  • A Simple Reader that reads data directly from the Data Grid (using spaces read).
  • A Notified Reader that registers for notifications on the Data Grid and is notified when data is written by the Data Loader.
    You can run one or more reader of either or both types.
  • An Account object, defined as a POJO (Java) or PONO (.NET), which represents the data in the Data Grid. It has the following fields: userName, accountID and balance.

Getting Source Code and Full Client Package

The source code of all three components, and the scripts used to run them, remains the same for all Data Grid topologies described above. To view the source code, use the links below:

The full Java client package including execution scripts is included, together with other GigaSpaces examples and tutorials, in the GigaSpaces examples ZIP. Find the client package for this tutorial at <GigaSpaces Root>\examples\Tutorials\Data_Grid\Topologies.

The full .NET client package can be found at the following path: <GigaSpaces Root>\dotnet\examples\Data_Grid. If you don't see this path, this is because when you download the product, the dotnet directory is initially zipped. Extract the ZIP file in the dotnet directory into <GigaSpaces Root\dotnet, then look for this tutorial's client package under <GigaSpaces Root>\dotnet\examples\Data_Grid.

Client Operating Process (In Brief)

  1. When you run the Data Loader, it:
    • Connects to the Data Grid and clears it from all data.
    • Creates a new Account object, with a certain userName and accountID. The Account also has a balance (Java) or Balance (.NET) field, which is obtained by calculating accountID*10 (Java) or AccountID*10 (.NET).
    • Writes 100 Account instances with IDs 1 through 100 to the Data Grid, using JavaSpaces write.
  2. When you run a Simple Reader, it reads all the Account instances in the Data Grid, then reads them again every few seconds, until you close it.
  3. When you run a Notified Reader, it registers for notification on the Account class, and starts listening for notifications. When Account objects are written to the Data Grid, the Notified Reader immediately receives notifications from the Data Grid. The notifications include the Account objects themselves.
  4. If you run more 'Simple Readers' or 'Notified Readers', they repeat step 2 or 3 above, respectively.

How the Client Application Connects to the Data Grid

The application connects to the space using the GigaSpaces SpaceFinder.find() (Java) or SpaceProxyProviderFactory.Instance.FindSpace(spaceUrl) (.NET) method. This is a method that accepts a space URL, discovers the space, and returns a proxy that allows the application to work with the space. The URL is usually not defined in the client application itself, but is supplied to it as an argument when it is started.

In this tutorial, we will use a space connection URL similar to the following:

jini://*/*/myDataGrid

  • This URL uses the Jini protocol, which enables dynamic discovery of the space (the client does not need to know which machines are participating in the Data Grid).
  • *//myDataGrid{*} specifies that the client wants to connect to a cluster in which all the spaces are called myDataGrid, regardless of which physical machines participate in the cluster.
  • useLocalCache is an additional parameter, not shown above, which launches a local cache in the connecting application. This is necessary for the master-local and local-view topologies.

The URL above is used by the application to connect to the space (a cluster of spaces in this case), so it is called a space connection URL. This should not be confused with a space start URL, a similar form of URL which can be used to start a space. In this tutorial, you will not use a space start URL, rather you will start the spaces using the GS-UI, as described below.

How Notifications Work

In a GigaSpaces Data Grid, applications can ask to be notified when changes are made to objects in the Data Grid. A request for notification has two components: a template and a mask:

  • The template specifies the class type and attribute values the application is interested in.
  • The mask (also called NotifyActionType in Java or DataEventType in .NET) specifies which events the application wants to be notified about - new data written to the Data Grid, data taken from the Data Grid, and so on.

GigaSpaces provides a mechanism that handles this process without requiring remote calls. The mechanism works as follows:

  1. The application instantiates an EventSessionFactory (Java) or IDataEventSession (.NET) and connects it to the space.
  2. The application creates an EventSessionConfig object, which can specify different options for the notification.
  3. The application passes the configuration object to the EventSessionFactory (Java) or CreateDataEventSession (.NET) method, and gets a DataEventSession (Java) or IDataEventSession (.NET).
  4. The application uses the DataEventSession.addListener (Java) or IDataEventSession.addListener (.NET) method to generate an EventRegistration (Java) or IEventRegistration (.NET) - the object that actually receives the notifications from the space. The addListener method accepts the notification template, the NotifyActionType (Java) or DataEventType (.NET) and other parameters, but most importantly the listener object.
    The listener object has a user-defined notify method, which is fired when a notification is received.
    The listener object is an EventHandler which is fired when a notification is received.
  5. When the Data Grid detects relevant operations, it contacts the EventRegistration (Java) or IEventRegistration (.NET).
    EventRegistration then fires the notify() method on the listener.
    IEventRegistration then fires the listener.

For every relevant space event, the Data Grid provides an object of type EntryArrivedRemoteEvent (Java) or SpaceDataEventArgs (.NET), which contains information about the event that occurred (e.g. new data was written), and also the actual object that was involved (e.g. the object that was written). The notify() method (Java) or the code (.NET) implemented in the listener can extract the object and perform operations on it.

Here is how the Notified Reader registers for notifications:

And here is the callback method invoked when the application is notified:

The Data - Defined as a POJO (Java) or PONO (.NET)

In this tutorial all the objects written to the space instances, which make up the Data Grid, are Plain Old Java Objects - POJOs (Java) or Plain Old .NET Objects - PONOs (.NET). This is in contrast to the tutorials in the Parallel Processing Track of this Quick Start Guide, in which objects written to the space implement the Entry class, as in the JavaSpaces standard.

To demonstrate use of POJOs (Java) or PONOs (.NET), the Account class is implemented with private fields, and with set/get methods (Java) or Properties (.NET) for each field, which enable the space to read and write the field value. For example:

Index Field for Partitioning

Inside the Account object, one of the data fields is defined as a routing index field for the purposes of partitioning. If this object is used in a Data Grid deployed in a partitioned topology, the routing index field is used to distribute data between the Data Grid instances, and to retrieve data from the relevant Data Grid instance when it is read.

In this tutorial, the routing index field is AccountID, and the partitioning algorithm is a hash. This means the operations on accounts are distributed evenly, based on the AccountID, between the Data Grid instances. You deployed two spaces (Data Grid instances), so all the operations on half the accounts - those with even IDs - go to one space, and all operations on the other half - with the odd IDs - go to the other space.

Here is how AccountID is defined as the index field (Java) or property (.NET), inside the Account object - in Java, using annotations before the get/set methods; in .NET, using attributes before the property:

When using JDK 1.4, instead of using annotations, an Account.gs.xml file should be placed in a folder named config\mapping. The file should contain the following:

persist="false" replicate="false" fifo="false" >

For more information on using a gs.xml file instead of annotations (in Java), refer to gs.xml Mapping Elements.


Running Client, Testing Notifications and Verifying Data Grid Topologies


Now that you have started the Data Grid topology of your choice, you can run the client application, described in the previous section, verify that the Notified Reader receives notifications, and then test that the Data Grid topology is functioning as expected (for example, that data is really being replicated between the spaces).

Before you begin - download and compile the client application:

  1. If you haven't done so already, extract the client application.
    If your <GigaSpaces Root>\dotnet folder contains a ZIP file, extract it.
  2. The client application package should appear at the following path:
    <GigaSpaces Root>\examples\Tutorials\Data_Grid\Topologies.
    <GigaSpaces Root>\dotnet\examples\Data_Grid
  3. Compile the client's source files by executing \bin\compile.bat (or .sh) from the example folder.

Select the topology you deployed from the tabs below.

What's Next?


Next tutorial:
How to query the space using SQL Query
Enterprise Data Grid Tutorial B - Aggregate Queries

Try Another Tutorial
GigaSpaces XAP Help Portal
GigaSpaces EDG Help Portal

Further Reading



IMPORTANT: This is an old version of GigaSpaces XAP. Click here for the latest version.
GigaSpaces 6.0 Documentation Contents (Current Page in Bold)

    Java

    C++

    .NET

    Middleware Capabilities

    Configuration and Management

Add GigaSpaces wiki search to your browser search engines!
(works on Firefox 2 and Internet Explorer 7)

Labels

 
(None)