|
Search XAP 6.6
Offline Documentation
Download latest offline documentation in HTML format:
|
Summary: About failure detection, reducing failure detection time, and relevant parameters.
Overview
Failure detection is the time it takes for the space and the client to detect that failure has occurred. Failure detection consists of two main phases:
One of two main failure scenarios might occur:
It takes GigaSpaces a few seconds to recover from process failure or a machine crash. In case of network cable disconnection, the client first has to detect that it has been disconnected from the machine running the space. Therefore, recovery time in this case is longer. For details on how network failure is detected and handled, see the About Network Failure Detection section. Reducing Failure Detection TimeConfiguring failure detection time can help you handle extreme failure scenarios more effectively. For example, in extreme cases of network disconnection, you might want the failover process to take 2-3 seconds.
Failure Detection ParametersSpace Side ParametersActive Election ParametersThe following parameters in the cluster schema active election block regard failure detection and recovery (it is possible to use XPath overrides instead of cluster schema values):
Watchdog Parameters
Client Side ParametersLiveness Detection PropertiesThe liveness detection mechanism defines the frequency in which the system checks the liveness of members – whether available members become unavailable:
Watchdog Parameters
Service Grid ParametersThe Service Grid uses two complementary mechanisms for service detections – the Lookup Service and fault-detection handlers.
The fault-detection handlers check periodically if a service is alive, and in case of failure, how many times to retry and how often. The GSM and GSC fault-detection handler settings are located in the services.config file. The PUFaultDetectionHandler is configurable using the SLA - member alive indicator. For logging information, it is advised to monitor service failure by setting the logging level to Level.FINE. # ServiceGrid FaultDetectionHandler logging com.gigaspaces.grid.gsc.GSCFaultDetectionHandler.level = INFO com.gigaspaces.grid.gsm.GSMFaultDetectionHandler.level = INFO org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler.level = INFO Jini ParametersThe LeaseRenewalManager in the advanced-space.config file is also related to failure detection and recovery:
***Link required Unicast discovery parametersWhen a Jini Lookup Service fails and is brought back online, a client (such as a GSC, space or a client with a space proxy) needs to re-discover it. It uses Jini unicast discovery retrying to connect to the failed remote lookup service. The default unicast retry protocol provides a graduating approach, increasing the amount of time to wait before the next discovery attempts are made - upon each invocation, eventually reaching a maximum time interval over which discovery is re-tried. In this way, the network is not flooded with unicast discovery requests referencing a lookup service that may not be available for quite some time (if ever). The downside is that it may delay the discovery of services if these are not brought up quickly. A discovery can be delayed us much as 15 minutes. If you have two GSMs and one fails, but it will be brought back up only in the next hour, then it's discovery will take ~15 minutes after it has loaded. These settings can be configured - see How to Configure Unicast Discovery. |
Failure Detection
IMPORTANT: This is an old version of GigaSpaces XAP. Click here for the latest version.
2-basic, system, book-all-java, detection, advanced, troubleshooting, failure, management, admin, book-admin, tasks

For more details about how failure scenarios are handled, refer to the