Production Readiness for Failure
“Failing to plan is planning to fail”. Many of us are acquainted with this quote attributed to Benjamin Franklin, the father of time management. While planning is a basic building block behind many production system architectures and workflow designs, we need to remember that system or application failures are unavoidable. According to the CAP theorem, a choice needs to be made between availability and consistency when a system crash occurs due to software or hardware failures.
The key questions that have to be asked are:
- Is it possible to beat the CAP theorem, or is it just like pushing a boulder up a never-ending hill, forever?
- Can production systems meet the challenges for Failure Mode and Effects Analysis (FMEA)?
- Is it possible to reduce the potential loss of data due to partial or complete application failure?
At GigaSpaces, we believe that production readiness can be addressed by FMEA to evaluate failure severities/occurrences and prioritize urgency. It can be achieved with InsightEdge, a distributed multi-model computation and analytics platform fueled by GigaSpaces Core, a widely used in-house, in-memory computing core for real-time/low-latency scenarios.
This post addresses the above questions in order to better understand the potential drawbacks of deploying a self-managed distributed platform instead of an in-production platform.
CAP Theorem and Redundancy
In general, the CAP theorem states that Consistency, Availability and Partition Tolerance cannot coexist. Before discussing the tradeoffs, it should be noted that in today’s distributed systems, Partition Tolerance is a must, which leaves the discussion to focus on Consistency and Availability:
- Consistency + Partition Tolerance (CP): Every client sees the same data at the same time, mostly used with storage-based, big cache scenarios favoring consistency over availability.
- Availability + Partition Tolerance (AP): Every client receives a response for all read and write requests at all times, regardless of the state of any individual node in the system. This combination can be found in distributed transactional systems favoring Availability over Consistency.
Figure 1: Simple Representation of the CAP Theorem and Placement of Solutions in the Market
Addressing the CAP Theorem Limitations
The CAP theorem limitations can be addressed by planning an appropriate distributed architecture for production-grade platforms. The GigaSpaces in-memory computing platform has been designed and developed to ensure that applications – running on top of the platform or deployed to the platform – are powered by a robust infrastructure that copes with distributed architecture challenges.
Four main High Availability and Disaster Recovery aspects will be discussed:
- Optimized Data Replication
- Automatic Self-Healing
- Multi-Region Deployment
- Disaster Recovery
Optimized Data Replication
GigaSpaces InsightEdge includes several levels of field-proven, reliable and high-performance replication mechanisms:
Level 1: Synchronous replication between peer nodes
A reliable and high-performance in-memory replication mechanism replicates data between peer nodes within a cluster, supporting a primary-backup topology by default. Consequently, each object has at least one backup in memory.
Figure 2: Optimized Data Replication
Level 2: Synchronous SSD/PMEM persistence
Synchronous persistence is used with a persistent store that can store large amounts of data to SSD and other persistent storage layers. This ensures that data is immediately persisted to disk, which can later be bootstrapped from the device to the memory when the system starts, reloading the indexing back to memory for instant recovery.
Figure 3: Synchronous Persistence to Any Type of DB/NoSQL
Level 3: Asynchronous persistence replication
Asynchronous replication is used to replicate data to persistent storage. This enables the storage of massive amounts of data for future usage in a range of solutions, such as databases, data warehouses, Hadoop and cloud object storage.
Figure 4: Asynchronous Persistence to Any Type of DB/NoSQL
Synchronous and asynchronous replication – the full picture
GigaSpaces InsightEdge provides multiple replication levels for robust availability according to organizational requirements, ensuring that the platform and associated applications are always on for:
- Real-time in-memory primary/backup
- Instant, fully synchronous replication to SSD and persistent memory
- Reliable asynchronous replication to any data store type
Figure 5: Synchronous In-memory Replication, Synchronous Replication to SSD and Asynchronous Persistence to Any Type of DB/NoSQL
Behind the scenes, GigaSpaces InsightEdge, a distributed platform leverages containers and is SLA-driven. Consequently, it has self-healing characteristics, ensuring that when one container crashes, the failed instances are automatically relocated to the available containers, providing the application with continuous high-availability.
This enterprise-level approach can be modeled on three available topologies – Partitioned, Partitioned with Backup and Replicated – as illustrated in the following example:
Figure 6: Three Partition Strategies Before “Container 4” Failure
If SLA Container 4 fails due to either software or hardware failure, a self-healing workflow is activated behind the scenes to ensure enterprise-grade availability and zero downtime, as illustrated below:
Figure 7: Automatic Failover (Backup Becomes Primary) and Automatic Self-Healing (Reallocation of Partitions to Assure Service Stability)
Global enterprises leverage multi-site deployments for:
- High availability
- Local performance boosts
- Guaranteed data synchronization
In multi-site deployments, the replication and sharing of data between multiple, geographically-distributed active clusters to support global business operations is a challenge.
The architecture presented below supports complete high availability for a multi-site deployment, without requiring third-party replication software or brokers which can be expensive and introduce additional latency.
Intelligent data replication mechanisms – including filtering, encryption, and compression – ensure optimized bandwidth utilization and data consistency at ultra-low latency.
Figure 8: Multi-region, Hybrid and Multi-cloud Reliable Asynchronous Replication
Multi-region, hybrid cloud and multi-cloud examples
GigaSpaces InsightEdge addresses the replication flow issues required for global deployments, as shown in the following use cases:
- Multi-region tier-based applications
GigaSpaces InsightEdge shares data between tier-based applications running across data centers/geographies as follows:
- Integrates data access across applications
- Provides data replication for handling disaster recovery requirements (Active-Active and Active-Passive)
- Provides data replication to development and test/staging environments
Figure 9: Data Replication to Development and Test/Staging Environments
2. Multi-region smart client-side caching over WAN
GigaSpaces InsightEdge delivers client-side caching for global reference data. This data is used by applications for user-profiles and multi-data center applications.
Figure 10: Conceptual Relationship Between the Local View/Cache and the Master Space (purple arrows)
3. Hybrid cloud and cloud bursting
GigaSpaces InsightEdge ensures that on-premise applications and data are replicated on-demand to the cloud.
Figure 11: On-demand Replication of Data to the Cloud
4. Multi-cloud data replication
GigaSpaces InsightEdge ensures the replication of data across different cloud vendors for maximum availability and other operational considerations.
Figure 12: Multi-Cloud Data Replication Across Vendors
Disaster recovery is another important challenge for production-grade platforms. The only way to futureproof a platform is to plan for failure. This means understanding the significance of any failure and planning the recovery path, under the assumption that loss of data is not an option.
As discussed above, for mission-critical systems, the platform architecture must be highly available, i.e., it must support N+1 primary/backup for real-time continuity of the system. In addition, multi-cluster availability for multi-region support and failover strengthens the overall high availability of the system.
Multi-site disaster recovery for a two cluster topology
In this case, one cluster is for standard clients and the other for mission-critical clients.
In case of failure, standard clients will be “cut-off” by the load balancer, while mission-critical clients will gain access to the active cluster.
Figure 13: Two Cluster Topology
In any disaster recovery overview, an understanding of “time between failures” concepts is crucial; Mean Time Between Failures (MTBF) refers to the amount of time that elapses between one failure and the next. Mathematically, this is the sum of the Mean Time To Failure (MTTF) and the Mean Time To Repair (MTTR). Consequently, it’s the total time required for a device to fail and for the failure to be repaired.
Figure 14: Differentiating Between Failure Metrics
Therefore, mission-critical systems must be sufficiently stable to extend the time between failures as much as possible and recover from such failures as fast as possible.
Suggested architecture for disaster recovery
Highly available deployment is important, but being highly available within the cluster is not enough. A suggested architecture includes a multi-cluster architecture with a load balancer on top for redirecting traffic, and a direct synchronous persistence layer (SSD) underneath the hood to ensure production-grade availability.
Figure 15: Suggested Disaster Recovery Architecture
Failure Mode and Effects Anal