Clarifying the In-Memory Conundrum
In-Memory Computing has been available in the market since the late 1990s. Some of the first commercial solutions were launched by GigaSpaces in 2000, and later by Tangosol in 2001 (Tangosol was acquired by Oracle in 2007). The Increase in network speed, computer power, and memory capacity made memory-based solutions a viable way to handle high-performance data-heavy tasks. The first use cases revolved around Command and Control, and Trading systems. Since then, the industry grew and today many vendors offer in-memory technology, mainly in the Java space but also in .NET, C++, and other programming languages.
Although the technology has been available and ubiquitously deployed, many people still get confused by in-memory computing terminology and the various disciplines it encapsulates including terms like In-Memory Data Grids (IMDG), In-Memory Databases (IMDB) and In-Memory Computing. Are they one and the same?
The answer is No. So let’s break it down for you.
In-Memory Database (IMDB) is a full-featured standalone database management system that primarily relies on RAM (Random Access Memory) for computer data storage. In other words, rather than employing a disk storage mechanism, it uses RAM. The rationale is simple: The Hard Disk Drive (HDD) which is based on magnetic storage technology first introduced by IBM in 1956 is an order of magnitude slower than RAM. IMDBs are designed to first achieve minimal response time by eliminating the need to access the disk, and second for data scalability. They are however limited in terms of application scalability. Some IMDBS require customers to rip & replace their existing databases. They may also be limited to the types of data models stored. Depending on how data is stored, which can be a row or columnar store, In-Memory databases can provide a fast response time for write or read-intensive workflows.
In-Memory Data Grid
In-Memory Data Grid (IMDG) is a simple to deploy, highly distributed, and cost-effective solution for accelerating and scaling services and applications. It is a high throughput and low latency data fabric that minimizes access to high-latency, hard-disk-drive-based or solid-state-drive-based data storage. The application and the data co-locate in the same memory space, reducing data movement over the network and providing both data and application scalability. Some in-memory data grids actually support any data model which can be directly ingested to the data grid (multimodel store) from real-time data sources or copied from an RDBMS, NoSQL, or other data storage infrastructure into RAM where processing is much faster. Some data grids also provide a unified API to access data from external databases and data lakes, and in essence expand the data managed to petabytes, while accelerating queries and analytics. This is a unique capability, and you can read more about accessing external databases.
Gartner states that “the IMDG is a key technology enabler for in-memory computing (IMC), a computing style in which the primary data store for applications (the database of record) is in the central (or main) memory of the (distributed) computing environment running the applications”. This design is optimized for negligible data access latency, even when large data volumes need to be queried for real-time analytics. See the following use case with a digital bank for more.
So What’s the Difference?
While In-Memory Data Grids share many of the features of In-Memory Databases, there are important differences.
One significant difference is that with an In-Memory Data Grid you can co-locate the business logic (application) with the data. With an In-Memory Database, the engine running the business logic or models resides on an application (or client) while the data resides on the server-side. This is not semantic. In the latter case, the data must travel over the network, which is significantly slower than running in the same memory space (with the added network overhead).
This also affects the scalability factor. While IMDBs can handle data scalability, In-Memory Data Grids’ distributed design allows complete scalability of both data and application load by simply adding a new node to the cluster.
Other differences have to do with the data type. While IMDBs usually handle structured data, some IMDGs also support semi-structured and unstructured data. And finally, some IMDGs seamlessly integrate with machine and deep learning frameworks.
Image: In-Memory Data Grid Architecture for unlimited scale
Table: Comparing In-Memory Database vs. In-Memory Data Grid vs. GigaSpaces In-Memory Computing Platform
Not all In-Memory Data Grids Are Created Equal
When considering an in-memory data grid, it’s imperative to pay attention to several important features that may differ from one in-memory data grid to another.
Some in-memory data grids offer strong consistency which means that read always returns the most updated data. Others only offer eventual consistency, meaning that reading data following an update might return the data prior to the update. Even after a successful write, data may be lost.
Eventual consistency is good for non-critical use cases. However, for critical applications such, booking, ordering, billing or money transfers strong consistency is required.
ACID (atomicity, consistency, isolation, durability) is a set of database transaction properties intended to guarantee validity even in the event of errors, or power failures. A sequence of data operations that satisfies the ACID properties is called a transaction. For example, a transfer of funds from one bank account to another, even involving multiple changes such as debiting one account and crediting another, is a single transaction. Some IMDGs support ACID transactions, some don’t.
Balancing cost and performance
Some in-memory data grids are limited to RAM only, or in other words, they only support hot storage but not warm storage like SSD (Solid State Drive) or cold storage such as cloud object-store. As your data volumes grow, intelligent data life-cycle management that can move data according to customized business logic will accelerate access to data on SSD and external databases, data lakes, and warehouses so that you can balance your cost and performance.
Another way to balance cost and performance is by auto-scaling. With this approach, a cluster can automatically scale up and down based on the workload. This way you only use and pay for the resources that are actually needed, and there is no need for overprovisioning.
The index is yet another main factor to impact performance. Some IMDGs offer only primary key access. This makes them not suitable for users who need to perform complex queries on the data.
Image: Example of tiered storage and integration with external databases
BI and visualization tools
If you plan to use BI tools such as Tableau, make sure your in-memory data grid offers native connectors or support standard database APIs such as JDBC or ODBC. This has a double benefit: you’ll accelerate report generation, and also enable your BI reports to reflect fresh – operational data for a real-time view and analysis.
SQL is the de-facto query language of business analysts. If your in-memory data grid has its own proprietary query language, your analysts will need to learn a new language. But the implications may go beyond just the learning curve. For example, some proprietary query languages don’t support distributed JOIN clauses. This limits the application’s ability to run complex queries and calculations.
Many data grids are limited to one type of data – key-value or object or document stores. This means that when the need arises to handle other types of data such as semi-structured, unstructured such as text, or images, you will need to deploy and maintain additional platforms that can handle that. In order to future-proof your technology stack, make sure your data grid solution supports multiple types of data: structured, semi-structured, and unstructured.
Machine Learning and Event-Driven Analytics
In-Memory Data Grids’ fast performance and scale are a perfect combination to power complex real-time machine learning queries. But only if they include a complete framework to support it. Some In-Memory Data Grids come with a complete Spark distribution including ML, SQL, Streaming, and GraphX. With others, you’ll need to install all these components separately and rely on limited integrations. Including a Spark distribution (rather than just a connector) allows for speed optimization and redundancy. Predicate pushdown, aggregation pushdown, and column pruning filter and aggregate data to the distributed data store, in lieu of loading all of the data to Spark, thereby increasing resource efficiency and speed.
Image: Convergence of Multiple Data Models
This enables faster and smarter insights by accelerating applications in which real-time analytics, machine learning and deep learning are used for real time insights such as predictive maintenance, live risk analysis, fraud detection, location-based advertising, dynamic pricing and more.
Some in-memory data grids also provide high availability so if a Spark executor fails in production (due to out-of-memory exceptions), the process does not have to be restarted from scratch, because its state is stored in memory for instant recovery.
Event-Driven Analytics is yet another powerful feature allowing you to trigger a method when an event takes place. For example: with an online processing service, you may want to trigger a notification once receiving an event that a payment has been canceled.
Additionally, the ability to contextualize streaming and transactional data with historical data at speed and at scale, feeds the machine learning feature vectors and allows for continuously retraining of models to ensure required accuracy.
DataOps is about automatically managing the entire life cycle of data: provisioning resources for data, scaling the database, and tuning data for performance. Some data Grids are evolving to cater for DataOps, offering auto-provisioning, auto-recovery, self-monitoring, and self-service orchestration.
The ability to replicate data between different regions or clouds, where each space can be physically located in a different geographical location. This capability is a common deployment topology to maintain data locality for analytics, as well as for disaster recovery planning and failover. This functionality can also support hybrid and multi-cloud deployments in an optimal manner. While most data grids require a 3rd party vendor like Solace or WANdisco to achieve this, others offer this functionality out of the box.
Image: Example of hybrid deployment
Analytics and In-Memory
In recent years as more analytics and machine learning-powered models are being developed and deployed, In-Memory Computing solutions are increasingly bridging the gap between transactional processing and analytics. Rather than replicating operational data for analytics resulting in data duplication and rear-view mirror analytics, a unified transactional/analytical processing, known as HTAP, augmented transactions or Translytical data platforms, powered by an In-Memory Data Grid offers real-time analytics in a fast feedback loop. Read more about Forrester’s Translytical data platforms.
Image: Traditional vs. Unified Transactional/Analytical Processing
So Where Do We Go From Here? In-Memory Data Grids and Computing To the Cloud!
In-Memory Data Grids are no longer enough. The market requires evolving to advanced In-Memory Computing Platforms that increasingly reside in the cloud. Last year Gartner published new research arguing that the future of database management is Cloud. Their message: on-premise is the new legacy. Cloud is the future. All organizations, big and small, will be using the cloud in increasing amounts. Even the large organizations that will maintain on-premise systems will move into a hybrid configuration supporting both cloud and on-premise. What does it mean for you? Make sure your applications and services can run anywhere – on-premise, in the cloud, in multi-clouds, and in any hybrid configuration.