In the modern digital enterprise, business insights must happen in real time.
Many of the applications we want to gain insight from and to optimize generate a deluge of data at very high throughput and micro intervals–business transactions, geospatial coordinates, device sensors, click-stream attributes. The types of data, just like the records themselves, are endless.
It’s imperative then, as a modern organization, characterized by being digital and insight-driven, to seek competitive advantage by asking the high value data questions at the passing business moment. In technical terms, this means being able to apply analytics workloads, machine learning models and automated decision making right when the data is born.
There are many potential advantages of implementing closed-loop analytics in real time, but until recently such architecture has been quite difficult to fully materialize.
Blending real time, batch and triggering workflows has been hindered by three fundamental challenges:
1. Fast Data’s Accidental Complexity
In the past couple of years, with the emergence of architectures like Lambda and Kappa, we’ve noticed a lot of GigaSpaces customers struggling to follow the integration-oriented approach of combining disparate systems together (Spark, Storm, HBase, Hadoop, Kafka, and MPP Databases) to come up with a fast data pipeline. The suggested practice of integrating too many heterogeneous systems (while federating access to their datasets at an application layer) has been the biggest challenge from a TCO and development complexity perspective.
2. Performance Limitations
While Apache Spark, along with a handful of in-memory computing data processing frameworks, has solved many Disk I/O bottleneck problems in processing data, there’s still much to be desired in terms of achieving true low-latency streaming and data analytics for mission critical solution verticals: telecommunications, IoT, financial services, to name a few. Simply caching underlying data across RDDs/DataFrames isn’t the silver bullet to enable real-time closed-loop analytics applications.
3. Slow Insight-to-Action Feedback Loop
Most enterprises today have been focused on building systems that accumulate data rather than process it in real-time. This is a natural undertaking as the systems that run the business (OLTP) have often been separated from those that manage it (OLAP). The separation here isn’t only at the architecture level, but also at software development delivery level: the world of OLTP applications (web apps, microservices) moves at a much faster delivery speed (with continuous integration and rapid deployments) than OLAP and Analytics applications. The latter has been focused on heavy data (generated by its OLTP counterpart) preparation and cleansing before providing analytics results to the enterprise, and eventually falling behind.
The InsightEdge Approach: High Performance Spark with OLTP Capabilities
So, how are we solving this at GigaSpaces? Today it’s my pleasure to announce the first GA release of a new GigaSpaces product that aims to make the life of the insight-driven business much easier: InsightEdge 1.0 GA. The premise we set out on delivering earlier this year, through our early access release, was about introducing a high performance Spark distribution with enterprise-grade OLTP capabilities. The key driver behind this approach is the core competency that GigaSpaces In-Memory Computing product portfolio has been touting as the fundamental way to scale and optimize massive data processing systems:
Move analytics workloads (computation) as close as possible to the data source and data in-motion. Thereby, enabling a hybrid transactional/analytics processing data store (XAP In-Memory Data Grid) that can turn Spark into an analytics system of record. This solves the many performance problems (by tiering Spark job execution between the data grid and Spark workers), as well as the slow feedback loop problem (bi-directional integration between application-generated data and Spark data structures).
This was made possible by leveraging the GigaSpaces XAP Open Source In-Memory data grid as the core storage and processing engine underneath InsightEdge Spark. In-memory data grids give us a way to process both transactional and analytical workloads at ultra-low latency and high-throughput, all while providing high availability and distributed processing across nodes, data centers, and clouds. Our goal is to combine the sophisticated analytics and ease of use of Apache Spark with a high-performance, ultra low-latency in-memory data grid that has been battle tested over the last decade across leading financial services, retail, telecommunication, and transportation institutions.
Going Beyond the Current Apache Spark Capabilities
We also recognize that the need for a truly hybrid system of transactions and analytical processing goes beyond the combination of a scale-out in-memory data grid and Apache Spark. Thus, we’ve enhanced Spark with a handful of capabilities that enable advanced enterprise-ready and data pipeline workloads:
Leveraging the native GeoSpatial API and query and engine of the underlying XAP in-memory data grid, developers can represent GeoSpaitial shapes (Point, Circle, Polygon) as a native attribute in DataFrames. Moreover, they can run GeoSpatial queries (nearest, within, intersects) using DataFrame filtering semantics of RDD queries.
Real-Time Transactional App Integration
Simply put, you can deploy your REST services or Web application side by side within an in-memory data grid that hosts persisted Spark RDDs/DataFrames. This provides bidirectional integration where any data transfer object (POJO, JSON, Document) can be surfaced as a Spark entity (DataFrame, RDD) and vice versa.
Multi-Data Center Replication and Disaster Recovery
In an attempt to make Spark more enterprise ready, we have fused multi-data center technologies of the data grid to provide the ability to enable master-master/master-slave Spark cluster replication for disaster recovery or edge IoT analytics scenarios.
What are the use case patterns for InsightEdge?
Since our early access release in March 2016, we’ve had the privilege and opportunity to pilot our solution across a couple of existing GigaSpaces customers. Below are a few reference architectures representing some fast data analytics use cases:
Hybrid Transactions / Analytics Processing: The architecture below is concerned with enabling a fast data closed-analytics pipeline. This allows the co-location of transactional applications side by side with Apache Spark data structures. The underlying data grid provides an eventing model that lets both tiers of the architecture upon data changes (e.g. new Spark ML model deployed triggering a new score for some transactional data).
Streaming Analytics Live Data Mart for Spark: To simplify the lambda architecture approach of blending real-time and historical data together, the underlying datagrid acts as a hybrid storage that on one hand can handle Spark streams query/ingestion, while on the other blend it with existing historical data from relational and NoSQL data sources. The “Serving Layer” now becomes unified query interface.
Geo-Analytics and Spatial Data Processing Spark: Location is the often the missing piece that needs to be blended with customer profiles to improve their experience. The architecture below (coming soon in a separate blogpost) showcases a simple Uber “price surge” data pipeline that integrates mobile web services, GPS data, Spark Streaming deployed under one cluster.
How Can I Start Using InsightEdge?
InsightEdge 1.0 GA has been released as open source under Apache 2 license. There are a variety of channels available to the community to try InsightEdge, follow the latest updates, engage with fellow users, seek guidance from product experts and get more information.
Learn: Website ◼ Documentation ◼ Blog ◼ Demo Videos Try: Download ◼ Zeppelin Interactive Demo ◼ Github Repo Engage: Slack ◼ Stack Overflow ◼ Email Follow: Twitter ◼ Facebook