Contents
1. Overview
2. Data lake vs data warehouse vs data hub
3. Analytical data vs. operational data
5. Is data integration complicated
7. Consider a Digital Data Hub
8. Roles of Embedded Event-Driven Architecture
The normalization of instantly available content and personalized data has sharply driven competition among organizations, leading to an explosion in digital services. This has severely stretched organizationsโ ability to deliver the โalways fresh โ always onโ data that modern digital applications need. There are differences of opinion among IT, data and application professionals on how best to overcome this challenge, especially for transactional data. Some believe a data warehouse or data lake can offer a solution. A database stores and processes data; it supports Online Transaction Processing (OLTP) and enables users and applications to interact with the data. Databases are normally controlled using a database management system (DBMS).ย
A data lake is a massive repository of structured and unstructured data, and the purpose of this data has not been defined. A data lake is often used by organizations that use massive amounts of raw data for machine learning. Data lake solutions temporarily store data without the need to transform the data first.
A data warehouse is a repository of highly structured historical data which has been processed for a defined purpose, commonly used by business analysts who need to decipher analytics in a structured system.
Some will assess data warehouses vs. data lakes, while others still may diverge from both data lakes and data warehouses and consider the merits of data warehouses and data lakes vs. data hubs.ย
Most IT professionals do agree however that enterprises, more than ever, require modernization of their backend and middleware architecture to improve performance for the digital age, facilitate lower TCO of their infrastructure, and optimize the data supply and data consumption food chain.ย
In my recent dialogues with IT and business executives, some of the key challenges they raise derive from a gap between the growing appetite for digital applications, and the pace at which data is can be made available and served to business applications. These professionals recognize that a new approach is needed but frequently are challenged with finding a solution that meets their needs. Hence, they fall back to familiar solution buckets such as data warehouses and data lakes, data stores and the like.ย ย
The natural tendency to seek solutions that fall into familiar categories is understandable, but nevertheless may limit an organizationโs options when seeking to solve new problems. Going beyond the familiar requires organizations to shift their focus from IT operations to delivering positive customer experiences. As part of this shift, organizations face numerous questions and challenges such as:
- How to create consistency across all channels, brands and devices?
- How to contextualize digital services based on real-time circumstances, location and indirect referential data?
- How to serve data to services in a proper fashion and a timely manner to meet an individual customerโs needs and expectations?
- How to deliver optimum personalized digital experiences?
To understand the technical gap that organizations must overcome in tackling these challenges, weโll break down the components that are part of this ecosystem โ and then rebuild it, better.
Data lake vs data warehouse vs data hub: Which best fits your organizationโs transactional data needs?
When assessing the most appropriate solution to meet the data needs of modern digital applications, IT and application integration teams should first and foremost ask themselves which use case โ or business outcome โ they seek to address.ย Digital services that rely on real-time data have specific needs that may not be necessarily be served by an organizationโs existing technology or data stack. Many public sources, such as technical blogs, compare the pros and cons of relational databases, NoSQLs, data warehouses (DWH) and data lakes. This wide range of data stores and database technologies inadvertently causes confusion in the industry about which should be used to do what. Ultimately its not a question of data warehouse vs data lake, rather whether these solutions address the use case the organization needs to address.
Analytical data vs. operational data
As a general rule, before jumping into the details of each solution type, it is best to differentiate between solutions that are designed to address analytical use cases from those designed to meet the real-time low latency needs of transactional use cases.ย Analytical Data refers to the mining and processing of historical data to reveal patterns, trends, and insights that aid strategic decision-making. By understanding past performance and identifying market trends, businesses use insights from analytics to formulate long-term strategies.
Organizations also need to figure out what portion of their data is operational, to avoid turning the data warehouse platforms into something they are not. Operational workloads support real-time processes, incorporating transactional information such as inventory control, order processing, and financial transactions. This data keeps the current state and serves the applications that run the business. Itโs constantly changing and is critical for immediate decision-making and task execution. By focusing on the purpose for which the technological solution was designed, we can address each component in the proper context of the enterprise architecture and optimize utilization and costs.
When considering the leading solutions as part of modernizing your enterprise architecture, the following factors should be taken into account:
- Continuous data integration
- Data consumption and exposure
- SQL interfaces
- Data compression
- Multiple native stacks vs. a fully integrated solution
- Supported data formats
- How each data store solution updates data
Rebuilding with a futuristic vision using digital hub architecture
Hereโs something that wonโt come as a shock to you: building software architecture is complex. Architects need to sync multiple data sources, multiple data types and pipelines, and the transformations that run between these sources.
One well-established notion is that data lakes and data warehouses fall short withย Event-Driven Architecture, as they are unable to serve APIs quickly, with high concurrency.
First, the ingress โ moving data to data lakes and warehouses is an offline or batch process, which almost always results in a built-in delay and high latency if the data is served from them.
Second, the egress โ most solutions utilize SQL and REST APIs above the data lake – is simply not fast enough to meet the latency demand of business applications.
To cope with these shortcomings, application developers started building small databases adjacent to business applications, often referred to as โdata martsโ or โlocal cacheโ, which also lead to high overall latency and data duplication across the different marts. This architecture pattern causes excessive data duplication and inefficiencies. Even worse, it often compromises data integrity between channels or applications. A common challenge with this pattern can be demonstrated by executing a basic โget my account informationโ query and receiving different results on the mobile app than on the internet website โ a true story that happened to me with a local credit card company.
Aย Digital Integration Hub (DIH) offers real-time, low-latency data delivery from backend systems to digital apps and business services. This platformย is built on a data hub concept. It eliminates this workaround and related issues by decoupling business applications and backend SoRs with event-based or batch replication patterns. The organizationsโ operational data is reflected in the consolidated fabric that powers real-time access by using advanced microservices, exposing relevant APIs and by doing that โย ย accelerating the API serving.
Is data integration complicated and where do CDC and ETL solutions fit into the mix?
All databases can ingest data from ETL (Extract, Transform and Load) solutions or Change Data Capture (CDC), which can be integrated with common databases and message brokers, so you might ask: whatโs the big deal here?
Hereโs the thing: the initial integration is not all that complicated. The truly hard work begins after integration, when architects, DBAs and developers have to do all kinds of wrangling to solve common integration challenges in existing systems, with countless production workflows that often have indirect dependability due to modern event-driven and API-based patterns. Before diving into the different challenges, letโs examine the simple data extraction and ingestion pipeline and what we need to handle:
- Data conflicts and reconciliation
- Multiple CDC streams
- Concurrent Initial Load and CDC without any downtime to data access or business services
- Schema evolution or adding new/existing tables dynamically to an ongoing CDC without restarting the service
- Scaling CDC streams to align with higher ingress/egress
- Handle logical data misalignments
- Metadata management and โtaggingโ data to map relationships between data and services
- Data freshness validation
- Data integrity between the DB and the โSystem of Engagementโ (SOE)
- Reflect transactional data from multiple tables in the SOE when pushing to a restreaming service
There might be other post-integration challenges, but most solutions in the market fall into one of the following categories: CDC, ETL, Databases/NoSQLs or Microservices, thusย lacking the holistic capabilities to handle the entire data lifecycle between Systems of Record (SoRs) and the business services. Off the shelf digital hub solutions such as Smart DIH, due to its unified, holistic architecture and monitoring capabilities, seamlessly unifies and manages the entire data lifecycle.
Can data lakes and data warehouses handle data synchronization?
Data Lakes and Data Warehouses are not optimized to meet transactional and operational workloads. The following table gives an overview of how data hubs, data warehouses and data lakes compare in the ways they handle data:
Consider a Digital Data Hub for numerous applications
Organizations face a growing need to scale up their digital services rapidly. This strong digital appetite comes with growing pains in performance, cost, and manageability as the thriving number of applications outgrow a certain comfort threshold.
Leveraging a converged, distributed real-time data hub solution with an embedded lightweight Java application server provides unprecedented performance and scale that canโt be achieved when using different solutions that are manually stitched together. The benefits include maintaining data integrity via a combination of collections and normalized relational data, together with the ability to perform certain operations, such as โjoinsโ across data in different formats.
Effective data management is the premise for delivering strategic business value from digital services. This requires having domain-oriented decentralized data ownership, combined with microservices-driven architecture to access enterprise shared data. This consolidated architecture provides a more flexible and easier scale for parallel reuse of functionality and data. As in classic microservices architecture that uses collections per service, data is duplicated between collections.
Multichannel integrity is achieved by reusing the same โdata access servicesโ from a single source of truth, as depicted here:
The GigaSpaces Data Hub: Unified multi-model data store pattern
What role does Embedded Event-Driven Architecture fill?
Many organizations have adopted the Event-Driven Architecture (EDA) methodologies and design principles as part of their data management strategy (more on this inย Kai Waehnerโs excellent blog). Companies such as Uber and Netflix are textbook examples of using EDA effectively. But hereโs one major caveat: these are technology workshops that happen to be streaming movies or orchestrating commutes, and having their entire budget built around these specific operations โ a luxury most organizations donโt have.
To achieve a simpler architecture that also provides a lower latency real-time response, embedded-EDA (eEDA) is built by embracing the architecture for embedding events, message queues and notifications as part of the extreme low-latency performance of in-memory workflow. This design, as opposed to traditional SOA which involves heavy multi-process communication and data transfer, is a real-time fabric based on theย โSpacesโ principles.
To enhance the utilization of events, GigaSpaces created an architecture with the following unique characteristics:
- Embedded Event Triggers
- Embedded Event Management Engine
- Embedded Event Priority Based Queues
- Embedded Event Priority Based Clusters (grouping)
- Embedded Outbound Messaging System (pub/sub notification pattern)
Event processing is improved immensely with co-location by injecting business logic to run in the same memory space as the data on the data fabric. The technological benefits include:
- Durable notifications via fully durable pub/sub messaging for data consistency and reliability
- FIFO Groups ensure in-order and exclusive processing of events
- No need to transfer events from the data tier to the service tier
- Related data can be co-located to the same group while parallelizing across additional groups
With reduced latency for business applications, the IT team can easily add contextual information to the queries while enhancing the overall volume of customer interactions.
Event-Driven Architecture for inbound and outbound
The reduced total cost of ownership
All architects know a simple truth: a design isnโt viable if its costs are unacceptable. Letโs keep this notion in mind when examining the trend of shifting to cloud computing in order to reduce costs.
Cloud has endless advantages, however, when used irresponsibly it can backfire without compassion. The following quote from theย Firebolts blogย captures this irony:ย โIf you look at the Fivetran benchmark, which managed 1TB of data, most of the clusters cost $16 per hour. That was the enterprise pricing for Snowflake ($2 per credit). Running business-critical or Virtual Private Snowflake (VPS) would be $4 or more per credit. Running it full-time with 1TB of data would be roughly $300,000 per year at list price.โ
Thinking about operational data, we often require tens or even hundreds of TB of data, resulting in an overpriced architecture just for the data tier โ before accounting for other middleware components such as CDC, ETL, Cache and others.
With the GigaSpaces digital integration hub solution, a unified and performance-optimized technology creates efficiency at scale. The platform reduces the need to replicate and mobilize data while simplifying data management. It substitutes costly standalone elements, driving direct and indirect cost savings by optimizing data management, reducing overall footprint, reducing usage and dependency on existing costly elements, and reducing operational load and maintenance costs.
GigaSpaces customers testify to a reduction in operational costs of 40-75%. This reduction of software and maintenance costs may vary based on elements being replaced or optimized with the introduction of GigaSpaces into the solution architecture stack. Hereโs one example: A fully digital bank operating in Sweden has made an entire stack of commercial RDBMS licenses redundant after two years of using the GigaSpaces solution, eventually substituting with a GigaSpaces stand-alone DIH as the bankโs Operational Data Store.
With the GigaSpaces solution in place, enterprises can also substitute some standalone data replication solutions that extract data to a single ODS, eliminating the need for additional costly expenditures. In addition, they will also no longer be required to add additional caching solutions, such as Redis, on top of the ODS.
Additional benefits include allowing software engineers to focus on developing new business logic instead of spending time on data-related and integration challenges, resulting in shorter time-to-service, from months to days, and reduced costs associated with human error.
Lastly, ongoing maintenance and support costs are reduced, as well as the expertise required per workflow.ย This is achieved by the standardization of data pipelines and data microservices, through no code and low code options provided with the GigaSpaces solution.
The blue line indicates a lower operational cost over time when using Smart DIH versus a โDIY Solutionโ leveraging multiple products
Putting it all together โ the full DIH package
After careful examination of the different technologies required to build a robust and price-effective solution, GigaSpaces built the solution architecture for the modern operational data store in the form of aย Digital Integration Hub (DIH).ย This solution enables organizations to focus on converging business and technology, reducing the stack complexity, and providing fast response time for new and upgraded digital services while reducing overall costs.
By simply upgrading a database, or adding a newer middleware component, organizations tend to improve performance in the short term, but the additional costs and overall complexity donโt provide the required ROI.
We can keep diving deep into the IT Gap and closely examine the specs of different data stores, but ironically enough the biggest challenges organizations face in digital transformation are not technological in nature. Rather, they revolve around changing the thought paradigm of managers signing off on these changes.