According to a recent estimate by IDC, unstructured data occupies more than 80% of the data by volume in the entire digital space of an enterprise. This massive corpus includes call center transcripts, product reviews, feedback forms, support case descriptions, social media, and blog articles. Not to mention, that in the age of IoT, customers aren’t the only sources of unstructured data. Sensors and network equipment generate log files and valuable information. Leaving most of this data untapped hinders enterprises from gaining visibility and insight into customer-facing business operations.
Such data growth certainly adds a lot of performance and scalability demands on architecting data lakes and analytics infrastructures. With innovative in-memory computing architectures (such as GigaSpaces XAP), distributed analytics and machine learning (GigaSpaces InsightEdge), in-memory data grids can provide a solution to query and analyze live text data feeds to operationalize unstructured data lakes in real-time.
Applications of Real-time Text Analysis
Mining insights from an endless stream of textual data can unlock deeply hidden insights across many industries. Organizations can apply in-memory data processing along with text mining algorithms to improve customer experience, reduce churn, and predict future customer demands. Let’s consider some of the use cases where real-time text analytics can move the needle:
Fraud detection is a billion-dollar problem in finance, affecting consumers and banks alike. Financial firms can analyze call center records, voice transcripts, and combine it with geospatial data to detect and prevent fraud through predictive analytics and machine learning techniques.
Monitoring product reviews and public activity on social media is now a business necessity for the omnichannel retailer. Retails want to track topics about their brand that are trending on Twitter for real-time channel retargeting. They want to be informed instantly when their customers post something with a negative sentiment about their brand.
As healthcare becomes more digital, the accuracy of ambulatory and hospital patient records is critical, as structuring health record is a key requirement to improving the quality of care. Healthcare organizations can analyze unstructured physician notes in real-time to predict epidemic outbreaks and provide accurate medical decision support algorithms.
Why Data Lake + Search Engine is Not Enough
The current data processing approach analyzing unstructured data by building a data lake architecture. A data lake is a large-scale data warehouse that holds vast amounts of unstructured data — to be transformed and analyzed when needed. In data warehousing terms, a data lake implements an Extract-Load-Transform (ELT) data pipeline. Consequently, the data lake will host hundreds (if not thousands) of terabytes of unstructured data (JSON files, text files, logs). Hence, HDFS becomes a common choice for a data store. In such an architecture, a search is ultimately a necessary component for both information indexing/retrieval as well as data catalog discovery.
While popular search engines (Solr, ElasticSearch) are great managing and indexing data lake contents, they are not built for low-latency, event-driven, and real-time text search against flowing streams of data for the above use cases. What we need is the ability to ingest, consume, and analyze billions of unstructured data points and seamlessly execute continuous real-time queries against them to generate contextual insights that are immediately accessible to customer-facing applications.
XAP 12.1 Search and Query: Operationalizing Data Lake Intelligence
Because an in-memory grid consolidates the storage of data in RAM and Flash with the processing of business logic in the same runtime space, real-time and event-driven text analysis, can be accomplished in milliseconds, as opposed to the minutes or hours it takes using a traditional search engine.
With XAP 12.1, we’ve extended the data modeling capabilities of our in-memory data grid to allow for full text indexing against in-memory data. This recent capability, along with others below, provides the foundational core capabilities of an operationalized data lake:
In-Memory Full-Text Indexing: XAP 12.1 introduces Full-Text Search API based on Lucene indexes and analyzers so users can run search queries (wildcard, fuzzy match) in memory at high throughput and low latencies. Combined with the rest of XAP’s event-driven container API, applications can trigger events and messages in the moment based on a text search criteria
Hybrid RAM/Flash Data Processing: To expand the in-memory data grid footprint beyond a few terabytes, XAP provides a multi-tiered data storage architecture (also known as MemoryXtend) that can scale low-latency data processing between RAM and Flash array across hundreds of terabytes.
Apache Spark Integration via InsightEdge: The core in-memory data grid engine that powers XAP can also be used through any Apache Spark API. This means that any geospatial, full-text, or structured data query result automatically becomes an RDD or a DataFrame. For a data lake architecture, this provides a very efficient ad-hoc in-memory data ingestion and transformation from HDFS to XAP through Spark jobs.
Leveraging an in-memory data grid (such as GigaSpaces XAP) coupled with high performance distributed analytics tools (GigaSpaces InsightEdge), scales unstructured data and real-time text processing across many computing nodes. The results can be processed at millisecond latencies to produce meaningful insights at the speed of business. Such performance and scalability are one of the primary enablers of democratizing data insight from data lakes.