Failover Group - Cluster Schema

  Search Here
Searching XAP 6.0 Documentation

                                               

Overview

Failover is the mechanism used to route user operations to alternate spaces in case the target space (of the original operation) fails. Several space members can belong to a failover group, which then defines their failover policy.

The component responsible for failover in GigaSpaces is the clustered proxy. This component maintains a list of spaces that belongs to the failover group. When an operation on a space fails, because the space member is unavailable, the clustered proxy tries to locate a live and accessible space member. If it finds such a space, it re-invokes the operation on that space member. If it doesn't find any live space member, it throws the original exception back to the user.

A space cannot reside in different load-balancing and failover groups. In other words, the only way to apply both load-balancing and failover to a space is to define these policies for one group, which the space belongs to.

The status of clustered spaces can be viewed using different logging levels. For more details, refer to the Viewing Clustered Space Status section.

Failover with Blocking and Transactions

Even if a client invokes a blocking operation, like take with timeout and the server fails after receiving the take arguments but before returning a result, the client gets an exception, and the failover process aborts. This is because in a clustered engine, unlike a non-clustered one, the server thread servicing the request waits for a final answer to be given (unless the request is timed out).
If an operation is performed under a transaction and the target space that serviced the transaction has failed, the clustered proxy automatically aborts the transaction and throws a transaction exception back to the caller. (There is no point in re-invoking the operation to a different space, because the failed space member is a transaction participant. The transaction will ultimately be aborted by the transaction manager.) The caller catching the exception can start a new transaction and continue execution; later calls on the proxy are directed to an available space in the group.

Selective Failover Within the Group

Like a replication policy, a failover policy can also be selective, i.e. some spaces may not failover to all other spaces. For each space, and for each operation (of write, read, take and notify), you can define backup members to which space operations may fail. The backup members are always strictly included in the failover group.
Some of these backup members can be designated as backup only. This means that the space is not made available to clients, but rests in wait for the master space to fail. This provides a stronger guarantee of availability.

Failover to Alternate Groups

A failover group can name alternate groups as backups. This means that if the failover process commences and no live space is found from the original failover group, the clustered proxy searches for live spaces in the alternate groups.
If there is more than one alternate group, the proxy searches them according to a specific order defined in the failover policy. This allows you to set priorities between failover groups in the cluster.
The following figure depicts a cluster of four spaces, two of them reside in the "East" site, and two on the "West" site:

A failover group groups the space members in the "East" and another failover group groups the space members in the "West". If an operation is made on an "Eastern" space that is down, the clustered proxy in the "East" tries to failover to another space member in the "East". If all spaces in the "East" are inaccessible, the proxy tries to find a live space member in the "West" group, and upon finding one, fails the operation to that space. The same occurs in vice verse: If the "Western" clustered proxy cannot fail to a space in the "West," it searches for an available space in the "East."

Backup and Backup-Only

For each space in a failover group, you can define one or more "backup members." These are other spaces, in the same failover group, to which operations will be routed if the space fails. To activate this option, you must set the space's failover policy to Fail to Backup.
It is also possible to create dedicated backup spaces. If you define a backup space as Backup Only, it will not be directly accessible to users, and will stand ready to receive failed operations from other spaces in the failover group.
To learn more about backup settings and how to define them, see Creating a cluster.

Failover with replication

While you do not have to define replication with failover, this is often necessary. The failover mechanism routes operations transparently to an available space. However, it does not ensure that the entries, which the user wanted to operate on, exists in that other space. If the user's operation is read or take, for example, it will not have the desired effect unless all the Entries from the master (target) space are precisely replicated in the slave space.
This can be achieved by defining both a replication policy and a failover policy for a certain group. This will ensure that as long as the master slave is live, its updates are replicated to the slave space, which is then ready to assume the role of failover slave.

Failover Schema Options

Tag Description Default Value
<policy-type> The failover policy. Can have one of the following options:
  • fail-in-group – specifies that if a space is down, the operation should be routed to a live space in the group, according to the load-balancing policy.
  • fail-to-backup – specifies that if a space is down, the operation will only be routed to one of the backup spaces defined. You may define backup space per operation – Default, Write, Take and Notify (Default applies to operations you did not explicitly define).
  • fail-to-alternate – specifies that if a space is down, the operation should be routed to an available space in one of the alternate groups
fail-in-group
<disable-alternate-group> Boolean value. Failover can happen into a space located in the same group as the source space or into a space located in another group. For example, a space that is part of a hash-based load-balancing group can fail into another group with a round-robin group. When this property is set to true, failover will not occur into an alternate group, but only to spaces located in the same group as the source space. false
<backup-members> The backup members list  
<source-member> Source member in the format of container:space.  
<backup-member> Backup member in the format of container:space.  
<backup-members-only> Boolean value. This option allows you to specify that a space, that serves as a backup member, be dedicated to this function. "Backup only" spaces are not used for anything other than routed failover operations.  
<alternate-groups> The Alternate Groups allows you to define one or more alternate groups. These groups serve as backups for the focal group (the group you are defining). In other words, if all spaces in the focal group fail, the cluster proxy routes the operation to an available space in one of the alternate groups.The alternate groups defined are relevant not only when all of the focal group's spaces fail. They are also used if you specify a policy type of Immediately Fail to Alternate for a certain operation.  
<fail-back> Boolean value. Controls whether a master will move into active mode after recovery from a failure. If this parameter is set to false a recovered master space will move into standby mode after restarting and completing the recovery phase and the existing backup space will continue and serve the application. The master will move into active mode only when it is its turn to move into active mode - i.e. all other backups failed. If this parameter is set to true, the master copy moves to active mode immediately after completing the recovery phase. false
<fail-over-find-timeout> This property determines the amount of time [ms] the cluster proxy waits, after receiving no reply from an alternate group, before deciding it is unavailable and trying another one (or giving up, if there are no more alternate groups). 2
<active-election> This block holds the <connection-retries>, <yield-time>, and <fault-detector> elements (below), and allows you to configure the active election mechanism.
For more details on active election, refer to the Active Election and Avoiding Split-Brain Scenarios section.
 
<connection-retries> The number of connection retries to the Jini Lookup Service. 60
<yield-time> Time to yield to other participants between every election phase (total 3 phases), before acquiring a primary or a backup state.
See related warning below.
1000 [ms] – each attempt lasts 1 second.
<fault-detector> The fault-detector exists in backup spaces, and is responsible for constantly checking (pinging) if the primary space is alive. This block contains the <invocation-delay>, <retry-count>, and <retry-timeout> elements, which are described below.  
<invocation-delay> The amount of time (in milliseconds) the fault-detector waits between each ping to the primary space. 1000 [ms]
<retry-count> If the fault-detector suspects that the primary space failed, it first confirms the failure before beginning the whole active election procedure. retry-count is the number of times it performs these confirmation checks. 3
<retry-timeout> The amount of time (in milliseconds) between the checks the fault-detector performs, if it suspects that the primary space failed. 100 [ms]

Changing Default Active Election Configuration

By default, the maximum amount of time it takes for a space to perform failover is ~9 seconds.

In case you changed the default settings, you can calculate maximum failover time using the formula below:

~100 [ms] + (yield-time * 7) + invocation-delay + (retry-count * retry-timeout) = failover

The first ~100 [ms] above regard network latency. Insert the value you defined for each of the elements in the formula above, and you will be able to calculate overall failover time according to your settings.

It is not recommended to change default values of elements in the active-election block (yield-time, retry-timeout, invocation-delay, retry-count), since this might cause split-brain scenarios.

Change the default settings only if you have a special need that requires failover to be 1 second long (or shorter). In this case:

  • Do not reduce the yield-time element to less than 200 [ms].
  • Take into account that reducing the invocation-delay and retry-timeout values, and increasing the retry-count element might accordingly cause network load.

For more details on active election, refer to the Active Election and Avoiding Split-Brain Scenarios section.


GigaSpaces 6.0 Documentation Contents (Current Page in Bold)

    Java

    C++

    .NET

    Middleware Capabilities

    Configuration and Management

Add GigaSpaces wiki search to your browser search engines!
(works on Firefox 2 and Internet Explorer 7)

Labels

 
(None)