Data is born fast, but its insight value is often short-lived. It’s a challenge that many enterprises seeking to seize business moments are trying to solve. Whether it’s a financial services firm building a fraud detection system, a telecommunication provider alerting its users of extra charges based on location, or a retailer providing shoppers better offers in realtime as they browse their catalog.
Within the last decade, we’ve often looked at data from a storage and historical perspective. But, we live in a world of converged infrastructures and heterogenous inter-connected touch points that are creating a vast data footprint which demand extracting insight once data is born. Enterprises are now interested more in turning transient data into actionable insights (fast transactions, clickstreams, geo-locations, and sensors) to create transformational, even disruptive, business opportunities. Already, we are seeing a significant shift from focusing on accumulating data lakes at scale to analyzing data insights at speed. The latter demands new architectural approaches in fast data analytics which we aim to solve.
The Sub-second Data to Action Lifecycle Imperative
Based on our experience with GigaSpaces customers building extreme transaction processing systems, we are finding a growing number of use cases that focus on the intersection of popular sophisticated analytics frameworks (mostly Apache Spark) with transactional data sources under one unified solution. This eliminates both the cost of ETL-to-Hadoop bottleneck as well as operational complexity of integrating a real-time streaming analytics data pipeline with transactional applications. Here are some important trends that are driving this imperative:
Hyper-personalization and Omni-channel
Customer experience is a top priority to 75% of data executives (Forrester) and a fundamental driving force behind real-time analytics adoption. In order to provide a seamless and contextual user experience across all customer touch points (Web, Mobile, In-Store, Call Center), a typical retailer or financial services firm will need to converge the customer’s historical data with realtime transactions across online/offline data domains within a few seconds.
Analytics over Transient Data
In today’s world of real-time and high-throughput data generating applications, most of the data that we deal with has a short life value. Consider the three classes of fast moving data below and their insight value few seconds after creation versus few minutes or hours later. In all the instances below, it’s not feasible to move all the data generated by a particular source to a centralized data center or cloud for processing. It makes sense, from an insight-to-action latency perspective, to capture insight from the data before transmitting it to the center.
InsightEdge: Moving analytics to data, not the other way around
A growing number of big data processing platforms have gained traction in the last couple of years to solve Disk I/O problems of massive data workloads. Apache Spark is most notable for providing an immutable caching layer on top of NoSQL data stores to unify batch, streaming, and other complex analytics under one common API and data structure (RDD/Data Frames). However, we cannot address the challenges of connecting insight with action just by doing more of the same data processing techniques faster. Once we gain insight, how do we make it actionable in real-time?
Our experience with customers has shown that enterprises who rely on data insights as primary means for competitive differentiation are far more successful when their analytics workloads leverage a decentralized and distributed in-memory computing approach. Instead of collecting data through streams or ETL to a centralized data lake for post-processing, analytic workloads must run at the data source or network edge. To help enable this connection between analytics and business impact, we look at in-memory computing (beyond just caching scenarios) as a key architectural component that will round out the capabilities required for a modern fast data ecosystem.
In-memory data grids give us a way to process both transactional and analytical workloads at ultra-low latency and high-throughput, all while providing high availability and distributed processing across nodes, data centers, and clouds. Our goal is to combine the sophisticated analytics and ease of use of Apache Spark with a high-performance, ultra low-latency in-memory data grid that has been battle tested over the last decade across leading financial services, retail, telecommunication, and transportation institutions.
Quick intro to what InsightEdge is about:
We provide an implementation of all Spark API’s (Spark Core, SQL, Streaming, MLLib, and GraphX) on top of a high-performance, extreme transaction processing, in-memory data grid which leverages RAM and optionally SSD/Flash storage for low latency workloads. Our goal is to tier the storage and processing of Spark data and workloads between Spark workers and underlying data grid containers. This significantly eliminates disk, network, compute, and Spark memory management bottlenecks in complex analytics workloads.
Simplified Real-time Data Pipelines
Our technology provides a single unified cluster that combines Spark, polyglot data API’s (Objects, Geospatial, Documents, JSON…etc), and seamless connectivity with upstream (e.g. Kafka) and downstream (e.g. HDFS, Cassandra, MongoDB) data sources. We believe this is the fastest and simplest way enterprises can stand up streaming pipelines at the data source or edge.
Where to next?
Our first early access release of InsightEdge will be available to download from our website when Strata+Hadoop San Jose 2016 conference commences. While we are working towards a GA release in June 2016, you can learn more about InsightEdge and browse the documentation. Don’t forget to also follow us on Twitter, LinkedIn, or stop by to chat with us on Slack.