We’re Living in a Data-Driven Eravoic
Today, most organizations are focused on becoming data-driven, as they try to take advantage of data lake technologies, streaming and associated technologies for the processing of data in motion. This drive to become data-driven means increasing investments in data processing, analytics and machine learning software, because data and its rapid processing is a key driver in enabling companies to grasp the opportunities presented by digital transformation and to deliver a range of benefits, including better operational efficiencies, understand voice of customer and gaining competitive advantage.
The Evolving Importance of Data
In a study of enterprise data and analytics that we at 451 Research performed at the end of 2018, Voice of the Enterprise Data & Analytics, 2H18, 75% of respondents said that data would become more important to their organization within 12 months.
Figure 1: The Increase in the Importance of Data in the Next 12 Months
Irrespective of their size or global location, this statistic shows how organizations are increasingly focusing on data. And not just on the data itself, but how it is stored, processed and analyzed. While this trend is not new, it is indeed changing. The storage of data in data warehouses for IT professionals to create relevant reports and dashboards for decision-makers and data analysts is evolving due the continuous exponential growth of data and the increasing variety of data sources, including enterprise and mobile applications, logs, clickstreams, social media, bots, IOT devices, sensors and more. There’s a constantly growing range of approaches for storing, processing and analyzing that data.
It’s no longer just the data warehouse; it’s also Hadoop, cloud storage, Spark and the use of AI and machine learning as part of the analytics process. Consumers of data have also grown from just the decision makers and data analysts using it for reports and dashboards to include data scientists and business users interacting directly with the data, as well as operational technology users.
Figure 2: The Evolving World of Data Sources and Data Users
The Growth of Data
As part of this evolution, the volumes of data being stored, processed and analyzed by organizations is increasing. The use of traditional applications dominated what we would consider the transactional era – which still accounts for a relatively large amount of data today, but nothing compared to what we see now. With the introduction of E-commerce and web/server logs, we saw the evolution of the interaction era. Finally, the engagement era emerges with increased interactivity resulting from the introduction of mobile data, social media, recommendation engines, digital assistants, sensors and IOT; all of which are accelerating the volume of data that is available to organizations which potentially needs to be stored, processed and analyzed.
Figure 3: Era Changes with the Growth of Data
Are Data Lakes the Solution for Data Storage?
To try and deal with the huge volumes of data, many organizations use data lakes for storage. This name was adopted because of their similarity to the function of actual lakes – large bodies of water fed from various source streams and accessed by multiple users for multiple purposes.
While the data lake idea was very attractive, at the outset several issues – such as the requirements, use cases and even how to build them – were unclear. Many organizations ended up placing a lot of data in a Hadoop environment, without any real idea on how to extract or even identify the contents, thereby failing to address how multiple users would access the data for multiple purposes.
Today, organizations are investing in their data lake strategies. There is increased focus on the importance of the data integration pipeline and industrial scale processes as part of the data lake. In particular, self-service access to the data processing pipeline and the underlying management and governance of data is being addressed.
Figure 4: The Data Processing Pipeline
Data Lakes Have Evolved but Can Fail to Meet the Speed and Performance Needs
Data lakes have rapidly evolved into a commodity service based on open source frameworks for storing and processing large volumes of data at low cost. Furthermore, the storage and processing of data at rest is increasingly being shifted to cloud storage environments being used as the basis for data lake environments.
But data lakes do not help with the processing of data in motion and fail to meet the concerns of organizations about the speed of data, the rate at which it’s produced, and how it can be rapidly processed. The issue here is not just about processing data faster; it’s also about the frequency in which the data is queried – increasing the rate of analysis in order to generate more accurate and more timely business intelligence.
More frequent analysis actually requires a change of thinking from organizations. It isn’t just about processing and analyzing data more quickly; it’s about processing and analyzing data continuously. Furthermore, data lakes have to be continuously updated and refreshed in order to deliver fresh insights and ensure decision-making based on the latest data.
Batch Processing or Stream Processing?
Batch and stream processing are the opposite ends of the spectrum, but it’s not one or the other. Batch processing of historical data can absolutely be complemented by real-time processing of live data. Consider, for example, a data science team using Hadoop or a cloud storage environment to process and analyze large volumes of historical data in order to identify the best time to present online offers in a retail application. The batch processing and model creation must be combined with real-time information about the customers (such as their recent browsing and search history) in order to create a targeted offer that could encourage customers to make a purchase or prevent them from churning.
This actually relies on real-time processing of data in motion whilst users interact with the application, even if the model is based on historical data processed in batch mode. And this explains the growth in the adoption of stream processing technologies.
Adoption of Data Lake Environments
58% of respondents in one of our recent studies (VotE, Data & Analytics, 1H19) indicated that they are currently using data lakes to some extent. Typical users are business intelligence and personnel responsible for digital transformation initiatives. And 23% of the users are using data lakes strategically. Furthermore, just under 55% of respondents are using data lakes to some extent for the processing and analytics of streaming data, and 18% claimed they are using this strategically. These trends are only going to grow, particularly as the adoption of stream processing becomes more mainstream.
Data at Rest and Data in Motion: Can Data Lakes Meet the Challenge?
Today and in the future, it’s not a matter of using just data at rest or data in motion. The key is actually using them together. But this presents challenges, since there are some real mismatches between the two architectures.
Data lakes are a great environment for storing and processing large amounts of data in a cost-effective manner. But there are issues – particularly latency, the mutability of the data, and the ability to update data once it’s in the data lake environment (rather than just refreshing with a new batch of data and rewriting). On the other hand, while data streaming enables real-time data processing, it’s a standalone environment, which lacks some of the historical context that can help shape decision-making based on the streaming data.
Today, the two are primarily being adopted separately. Batch data processing analytics is being implemented with Hadoop and increasingly cloud storage, while data in motion with streaming analytics based on standalone products and services is being implemented in streaming data platforms, either on-premises or cloud-based services.
The real benefit is in combining data processing on data at rest and data in motion, and this is what we see in the new breed of applications which involve hybrid operational and analytic processing and the need to perform analytic processing on real-time operational data in motion.
Some applications – such as risk analysis, fraud detection, personalization recommendations, location-based advertising, predictive maintenance and dynamic pricing – actually rely on this combination of capabilities. These are automated applications which require a real-time decision, action or response based on the processing and analysis of both real-time information and historical data that helps feed the context.
To date, many have tried to provide this combination using the Lambda architecture, which is based on distributed data processing systems. Data processing in the Lambda architecture actually happens in two separate layers – the batch layer and the speed layer – which presents some significant challenges:
- The batch data is immutable and append-only
- The query speed for the batch layer is slow
- Since data is separated into two layers, there are possible challenges in terms of duplication, consistency and the potential for delays when serving up data from that batch layer.
- The presence of at least two separate layers has the potential for architectural complexity.
Figure 5: Lambda Architecture Challenges (source: http://lambda-architecture.net/)
Conclusions
It’s clear that data lakes have emerged as the primary platform for storing and processing large volumes of data for historical analysis. However, they have inherent limitations in terms of processing data in motion, because they were never originally designed for this purpose.
The processing of data in motion requires a change of thinking. It isn’t about processing and analyzing data more quickly, but about processing and analyzing data continuously. Additionally, batch processing of historical data can be complemented by real-time processing of live data. Clearly, the two can go together. It’s about how you architect, deliver them both in combination.
Finally, the Lambda architecture for processing data in two separate layers – the batch layer and the speed layer – presents users with a number of challenges, including duplication, inconsistency caused by delays in making new data available via the batch layer, and challenges in adding update operations to batch environments.
Vendors are seeking solutions to these challenges.
“GigaSpaces has taken a differentiated approach with its combination of the core data grid/cache functionality and Apache Spark to provide high-performance data ingestion, as well as a unified interface for both batch and real-time analytics. “
GigaSpaces Adds ‘Data Lake Accelerator’ To Its InsightEdge In-Memory Computing Platform, March 2019
All this indicates a bright future for simplifying the data journey and accelerating data lakes to deliver faster, smarter insights.