After having the opportunity to present this topic at the recent Big Data Expo in Utrecht, Netherlands (and a few other expos) I thought it would also be a good time to share my thinking with the online community. Data storage & processing has always been a dominant concern for any enterprise and with today’s fast evolving technology it necessary to design flexible & adaptable data solutions such that new technologies can be easily added, removed &/or swapped out. We need dynamic solutions that are agile and able to evolve fast, grow bigger than ever and deliver linearly scalable performance.
In this post I’ll be going through a quick overview of the big data technology space and the current dominant players with a short & concise round up on the features and trade-offs of each. This should give the necessary context before moving on to the core of this post where we will see how we can create a mesh of these technologies and offer our users a high-speed common access API to each of those services through a high-speed data-store. Finally we will see some rich query examples that we can perform on our data set.
Overview of the Big Data landscape
Let’s start with one of the more usual suspects, SQL. These are the standard RDBMS offering the common SQL API which enables aggregations such as joins and group-by’s and comes with very strong consistency levels (transactional). As with any technology there are always trade-offs and in this case it would be, scalability, performance and rigid data-modelling. SQL technologies are ideal as master data stores but will need to be frontended with a caching mechanism to improve performance.
With No-SQL technologies we can generally enjoy better performance and scalability, due to their distributed nature, and also benefit from a highly flexible data-model (object-storage). But when it comes to strong consistency (eventually consistent) and aggregation operations such technologies tend to fail to deliver although there are some that do support it. Map/Reduce operations are also an added benefit.
Another hot technology that has been around for a number of years now is In-Memory Data Grid technology a.k.a. IMDGs. IMDG’s are distributed in-memory object data stores that also provide a processing / compute layer near the data. That means that each data partition will have its own instance of business code to enable data transformation closer to the source avoinding the overheads of reading, changing and updating. IMDG’s provide strong consistency levels, high-availability, built-in DR and data-center replication. Meaning they can serve as a resilient, robust, high performance data-store and compute layer. Of course the downside is that it is all in memory therefore attaching asynchronous persistency is fundamental.
Very similar to NoSQL and IMDG are KV (Key/Value) technologies. These are generally tailored for extremely fast reads hence one of the most common use-cases being data caching. They do not usually offer aggregation, projection or partial updates nor strong consistency levels.
Streaming solutions are another example of a new and disruptive kind of technology in the big data arena. Designed to be ideally stateless (although some do offer hooks to maintain state in external data-stores) these solutions are tailored for extremely fast distributed stream data processing. This is all about enabling true real-time decision-making based on insights.
As evident from the illustrations, SSD or flash storage is another key technology that is quickly shaping the big data landscape. For the sake of scope we will cover more around this in the next post.
To sum up, there are plethora of technologies each serving a subset of use-cases with no one-solution-fits-all. Projects or solutions usually will always span across a number of diverse use-cases and therefore it is important that we put a platform in place that can aggregate the relevant technologies and provide us with the benefits and capabilities of each.
Designing the high-speed data-hub
The first thing we have to take into consideration is that each of these data-storing / processing technologies we covered previously offer their own query semantics and therfore a data-hub delivering a mesh of such will need to cater for as many of these as possible and hence our choice of IMDG as the integration bus.
Another important factor to take into account is the ease of integration with such technologies. To our advantage, many IMDG’s already offer OOTB connectors which can help fast-track the integration process.
Why do we need multiple query semantics? Take for example the case of an online media store like Netflix or Amazon Prime. Lets take 3 basic flows:
- Search movies / TV shows
- Purchase a movie or TV show
- View purchase history
(1) Searching is ideally done through Elasticsearch, (2) order processing is done as a transaction and (3) viewing the purchase history is a combination of both. Movies and TV shows are indexed in Elastisearch from the grid and also mirrored to SSD or an external persistency store (SQL / NoSQL). Order processing will be executed as transactional operation and if successful then we mirror that order again to our external persistency store which is our master data-store. Processed orders will then need to also be synchronised into Elasticsearch so that they are searchable by the user at any point.
For the front-end implementation it is important to provide an abstraction to such operations. You should always avoid point-2-point integration to the relevant data sources for the required operation hence why a solution like a data-hub that aggregates these sources and provides a common access API aligns nicely with the best-practices of a decoupled architecture.
Furthermore, the right-hand side of the illustration shows how we can additionally integrate with stream processors and Hadoop to deliver real-time and batch insights. What is important to mention here is if the stream processor requires to dip out and maintain state then an IMDG can offer this capability without impacting performance.
Inside the grid
So let’s have a quick look on what really goes on in the grid and how the integration is actually achieved. The basic idea is to use the relevant mechanisms to integrate with the external components depending on the case. For example pushing new data immediately into Elasticsearch to allow fast searching could be realized using a synchronization end-point or mirroring capability where the data in the grid is replicated asynchronously to Elasticsearch. We can use the same mechanism to mirror our data into a traditional RDBMS or NoSQL and using it perhaps as DR capability or archive container enabling us to refresh the grid when necessary or retrieve any data not in memory. We can also configure this to be synchronous and transactional if strong consistency is desired.
We can also leverage an event-driven approach where-by we can use the grid as a messaging platform to transform and massage incoming data before pushing it out to downstream systems. This way we can ensure our data is enriched before being sent to other services or systems.
Multiple Query Semantics
Now that we have our data-hub in place lets look at a few examples of interesting & rich queries we can perform on our data-set. In the first example we see nested queries and projections, meaning we can query nested fields, objects or list and map values. We can also use projections to improve our solution efficiency by only requesting to retrieve the relevant fields.
Another important facet in the big data world is the ability to perform data aggregation at the source. Lets look at some examples:
Some more complex & compound aggregation operations where we can perform grouping and filtering:
And lastly some examples leveraging fast-update and change API:
We set off with a brief overview of the Big Data technology landscape and some of the key players and features out there. We then followed up with a proposal and detailed explanation on how we can create an enterprise-wide high-speed data-hub that can aggregate these data sources in almost a plug & play way.
If there is a single take-away from this post it is the necessity to build agility and flexibility in our solutions to cater for the fast-paced evolution of technology in order to benfit from reduced TCO, increased innovation momentum and competitive market advantage.
In my next post I will cover the significance of SSD in the Big Data and In-Memory space and how we can introduce an intermediate layer to the Lambda architecture to cater for near real-time data processing.