Production Readiness for Failure
“Failing to plan is planning to fail”. Many of us are acquainted with this quote attributed to Benjamin Franklin, the father of time management. While planning is a basic building block behind many production system architectures and workflow designs, we need to remember that system or application failures are unavoidable. According to the CAP theorem, a choice needs to be made between availability and consistency when a system crash occurs due to software or hardware failures.
The key questions that have to be asked are:
- Is it possible to beat the CAP theorem, or is it just like pushing a boulder up a never-ending hill, forever?
- Can production systems meet the challenges for Failure Mode and Effects Analysis (FMEA)?
- Is it possible to reduce the potential loss of data due to partial or complete application failure?
At GigaSpaces, we believe that production readiness can be addressed by FMEA to evaluate failure severities/occurrences and prioritize urgency. It can be achieved with InsightEdge, a distributed multi-model computation and analytics platform fueled by GigaSpaces Core, a widely used in-house, in-memory computing core for real-time/low-latency scenarios.
This post addresses the above questions in order to better understand the potential drawbacks of deploying a self-managed distributed platform instead of an in-production platform.
CAP Theorem and Redundancy
In general, the CAP theorem states that Consistency, Availability and Partition Tolerance cannot coexist. Before discussing the tradeoffs, it should be noted that in today’s distributed systems, Partition Tolerance is a must, which leaves the discussion to focus on Consistency and Availability:
- Consistency + Partition Tolerance (CP): Every client sees the same data at the same time, mostly used with storage-based, big cache scenarios favoring consistency over availability.
- Availability + Partition Tolerance (AP): Every client receives a response for all read and write requests at all times, regardless of the state of any individual node in the system. This combination can be found in distributed transactional systems favoring Availability over Consistency.
Figure 1: Simple Representation of the CAP Theorem and Placement of Solutions in the Market
Addressing the CAP Theorem Limitations
The CAP theorem limitations can be addressed by planning an appropriate distributed architecture for production-grade platforms. The GigaSpaces in-memory computing platform has been designed and developed to ensure that applications – running on top of the platform or deployed to the platform – are powered by a robust infrastructure that copes with distributed architecture challenges.
Four main High Availability and Disaster Recovery aspects will be discussed:
- Optimized Data Replication
- Automatic Self-Healing
- Multi-Region Deployment
- Disaster Recovery
Optimized Data Replication
GigaSpaces InsightEdge includes several levels of field-proven, reliable and high-performance replication mechanisms:
Level 1: Synchronous replication between peer nodes
A reliable and high-performance in-memory replication mechanism replicates data between peer nodes within a cluster, supporting a primary-backup topology by default. Consequently, each object has at least one backup in memory.
Figure 2: Optimized Data Replication
Level 2: Synchronous SSD/PMEM persistence
Synchronous persistence is used with a persistent store that can store large amounts of data to SSD and other persistent storage layers. This ensures that data is immediately persisted to disk, which can later be bootstrapped from the device to the memory when the system starts, reloading the indexing back to memory for instant recovery.
Figure 3: Synchronous Persistence to Any Type of DB/NoSQL
Level 3: Asynchronous persistence replication
Asynchronous replication is used to replicate data to persistent storage. This enables the storage of massive amounts of data for future usage in a range of solutions, such as databases, data warehouses, Hadoop and cloud object storage.
Figure 4: Asynchronous Persistence to Any Type of DB/NoSQL
Synchronous and asynchronous replication – the full picture
GigaSpaces InsightEdge provides multiple replication levels for robust availability according to organizational requirements, ensuring that the platform and associated applications are always on for:
- Real-time in-memory primary/backup
- Instant, fully synchronous replication to SSD and persistent memory
- Reliable asynchronous replication to any data store type
Figure 5: Synchronous In-memory Replication, Synchronous Replication to SSD and Asynchronous Persistence to Any Type of DB/NoSQL
Automatic Self-Healing
Behind the scenes, GigaSpaces InsightEdge, a distributed platform leverages containers and is SLA-driven. Consequently, it has self-healing characteristics, ensuring that when one container crashes, the failed instances are automatically relocated to the available containers, providing the application with continuous high-availability.
This enterprise-level approach can be modeled on three available topologies – Partitioned, Partitioned with Backup and Replicated – as illustrated in the following example:
Figure 6: Three Partition Strategies Before “Container 4” Failure
If SLA Container 4 fails due to either software or hardware failure, a self-healing workflow is activated behind the scenes to ensure enterprise-grade availability and zero downtime, as illustrated below:
Figure 7: Automatic Failover (Backup Becomes Primary) and Automatic Self-Healing (Reallocation of Partitions to Assure Service Stability)
Multi-Site Deployment
Global enterprises leverage multi-site deployments for:
- High availability
- Local performance boosts
- Guaranteed data synchronization
In multi-site deployments, the replication and sharing of data between multiple, geographically-distributed active clusters to support global business operations is a challenge.
The architecture presented below supports complete high availability for a multi-site deployment, without requiring third-party replication software or brokers which can be expensive and introduce additional latency.
Intelligent data replication mechanisms – including filtering, encryption, and compression – ensure optimized bandwidth utilization and data consistency at ultra-low latency.
Figure 8: Multi-region, Hybrid and Multi-cloud Reliable Asynchronous Replication
Multi-region, hybrid cloud and multi-cloud examples
GigaSpaces InsightEdge addresses the replication flow issues required for global deployments, as shown in the following use cases:
- Multi-region tier-based applications
GigaSpaces InsightEdge shares data between tier-based applications running across data centers/geographies as follows:
- Integrates data access across applications
- Provides data replication for handling disaster recovery requirements (Active-Active and Active-Passive)
- Provides data replication to development and test/staging environments
Figure 9: Data Replication to Development and Test/Staging Environments
2. Multi-region smart client-side caching over WAN
GigaSpaces InsightEdge delivers client-side caching for global reference data. This data is used by applications for user-profiles and multi-data center applications.
Figure 10: Conceptual Relationship Between the Local View/Cache and the Master Space (purple arrows)
3. Hybrid cloud and cloud bursting
GigaSpaces InsightEdge ensures that on-premise applications and data are replicated on-demand to the cloud.
Figure 11: On-demand Replication of Data to the Cloud
4. Multi-cloud data replication
GigaSpaces InsightEdge ensures the replication of data across different cloud vendors for maximum availability and other operational considerations.
Figure 12: Multi-Cloud Data Replication Across Vendors
Disaster Recovery
Disaster recovery is another important challenge for production-grade platforms. The only way to futureproof a platform is to plan for failure. This means understanding the significance of any failure and planning the recovery path, under the assumption that loss of data is not an option.
As discussed above, for mission-critical systems, the platform architecture must be highly available, i.e., it must support N+1 primary/backup for real-time continuity of the system. In addition, multi-cluster availability for multi-region support and failover strengthens the overall high availability of the system.
Multi-site disaster recovery for a two cluster topology
In this case, one cluster is for standard clients and the other for mission-critical clients.
In case of failure, standard clients will be “cut-off” by the load balancer, while mission-critical clients will gain access to the active cluster.
Figure 13: Two Cluster Topology
In any disaster recovery overview, an understanding of “time between failures” concepts is crucial; Mean Time Between Failures (MTBF) refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of the Mean Time To Failure (MTTF) and the Mean Time To Repair (MTTR). Consequently, it’s the total time required for a device to fail and for the failure to be repaired.
Figure 14: Differentiating Between Failure Metrics
Therefore, mission-critical systems must be sufficiently stable to extend the time between failures as much as possible and recover from such failures as fast as possible.
Suggested architecture for disaster recovery
Highly available deployment is important, but being highly available within the cluster is not enough. A suggested architecture includes a multi-cluster architecture with a load balancer on top for redirecting traffic, and a direct synchronous persistence layer (SSD) underneath the hood to ensure production-grade availability.
Figure 15: Suggested Disaster Recovery Architecture
Failure Mode and Effects Analysis (FMEA) Scenarios
Looking at the bigger picture, the following FMEA scenarios provide an understanding of failures that can happen, and how they can be avoided or recovery achieved as quickly as possible.
FMEA Scenario 1: MTTR
MTTR is a basic measure of the maintainability of repairable items, representing the average time required to repair a failed component or device.
Challenge: When a system is NOT “enterprise-level” HA, a failure of one cluster consequently leads to an MTTR window of over an hour, forcing continuous operations with only one cluster for the duration of the bootstrap.
Figure 16: FMEA – Complete Failure and Recovery of a Site
Solution: With the InsightEdge Platform, recovery time takes seconds, because it is not necessary to perform an initial load of all the data from a database. By leveraging SSD persistence within InsightEdge, only the index metadata is loaded to memory.
FMEA Scenario 2: Failed cluster recovery
Challenge: When a failed cluster recovers, the initial load from another site can take from minutes to hours, during which time the replication redo log (e.g., buffer) between the clusters can fill up. This can cause replication to slow down and even fail.
Figure 17: FMEA – Slow Recovery Due to Limited Bandwidth Between Sites
Solution: Leveraging both synchronous and asynchronous replication mechanisms, the initial boot is executed from the SSD, persistent memory or database and the delta is replicated from the second site.
FMEA Scenario 3: Scaling and grid topology
Grid topology is bound to the size and number of nodes deployed.
Challenge: The main challenges when scaling a distributed grid are:
- The configured memory heap reaches the maximum recommended size for standard operations, and scaling-up can introduce garbage collection issues.
- Scaling-out by increasing the number of partitions is difficult. Amortizing (multiplying the number of partitions by two) is an option, but this presents the following additional challenges:
- High costs due to over-provisioning
- Unknown sizing
- Increasing the number of partitions may introduce latency or transactional issues
Figure 18: FMEA – Challenges when Scaling a Distributed Grid
Solution: Using synchronous replication, data is pushed down to the SSD or persistent memory layer. This frees-up memory for metadata and increases the amount of data that can be stored in the platform’s space by 10X.
FMEA Scenario 4: Recovery Time Objective (RTO)
RTO is the targeted time and service level within which a business process must be restored after a disaster (or disruption), in order to avoid unacceptable consequences associated with a break in business continuity.
Challenge: A worst-case scenario – the loss of two clusters – occurring when using mirroring to the database will require about 1 hour for a full initial load of 1 TB from a database, i.e., mission-critical clients will have to wait at least 1 hour for at least one cluster to be available.
Figure 19: FMEA – RTO and Business Continuity
Solution: With GigaSpaces MemoryXtend, the data is persisted, ensuring that the “initial load” is completed in a few minutes or less. This is because there is no initial load (such as from RDBMS or NoSQL), since the data is immediately available on the SSD, unless specific business rules have been deployed that intend data to be initially loaded to RAM on restarts.
Final Thoughts and Further Reading
Fail-fast and recover-quickly are basic requirements that result in the need for full consistency and fast recovery modes.
This article has presented best practices and various options for maximizing availability and preventing data loss; including multiple architecture recommendations and the benefits of leveraging new and emerging fast-performing data storage devices, such as SSD and Optane DC Persistent Memory.
To understand more about the internal FMEA of a distributed platform and how to ensure availability and consistency in the presence of network partitions, read here>>
To understand more about intelligent multi-tiered storage, read here>>
To solve the preservation of consistency challenge, with availability lost only to the minority side of the partition and restored immediately once a majority has been regained, read how ZooKeeper is used as a centralized service for maintaining discovered instances and allowing distributed synchronization to assist in quorum-based decisions.