Hadoop vs. Spark – An Accurate Question?
I just googled Hadoop vs. Spark and got nearly 35 million results. That’s because Hadoop and Spark are two of the most prominent distributed systems for processing data on the market today. It’s a hot subject that organizations are interested in when addressing their big data analytics. Choosing the Right Big Data Software; Which is the best Big Data Framework?; How Do Hadoop and Spark Stack Up? The Death of Hadoop?; and The New Age of Big Data are just some examples of blogs and articles I found.
So, what are these articles addressing? A comparison of Hadoop and Spark in order to determine which is the better solution? Whether Hadoop and Spark compete in the big data space? Is Hadoop required for Spark? Is Spark required for Hadoop? Is Spark faster than Hadoop? Is Hadoop passé? How can existing Hadoop deployments be leveraged? What alternatives are available for Hadoop in the cloud?
I think that these questions can be fine-tuned to reflect today’s reality; it’s time to debunk the myth.
Hadoop and Spark are different platforms, each implementing various technologies that can work separately and together. Consequently, anyone trying to compare one to the other can be missing the larger picture.
Like any technology, both Hadoop and Spark have their benefits and challenges. But the fact is that more and more organizations are implementing both of them, using Hadoop for managing and performing big data analytics (map-reduce on huge amounts of data / not real-time) and Spark for ETL and SQL batch jobs across large datasets, processing of streaming data from sensors, IoT, or financial systems, and machine learning tasks.
Is that enough for today’s big data analytics challenges, or is there another missing link?
What is Hadoop?
Hadoop is an open-source distributed big data processing framework that manages data processing and storage for big data applications running in clustered systems, i.e., it’s a file system for storing data from different sources in big data frameworks. Its architecture is based on a node-cluster system, with all data sharded across multiple nodes in a single Hadoop cluster. Consequently, Hadoop is a framework that enables the storage of big data in a distributed environment so that it can be processed in parallel.
The main Hadoop components are:
- HDFS, a unit for storing big data across multiple nodes in a distributed fashion based on a master-slave architecture.
- NameNode, the master daemon that maintains and manages the DataNodes (slave nodes), recording the metadata of all the files stored in the cluster and every change performed on the file system metadata.
- DataNodes, the slave daemons running on each slave machine which store the actual data, serve read and write requests from clients and manage data blocks.
- YARN, which performs all processing activities by allocating resources and scheduling tasks through two major daemons – ResourceManager and NodeManager.
- ResourceManager, a cluster-level component running on top of YARN for managing resources and scheduling applications.
- NodeManager, a node-level component running on each slave machine for managing containers, monitoring resource utilization in each container, node health and log management.
- MapReduce, which performs all the necessary computations and data processing across the Hadoop cluster.
What is Spark?
Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers, i.e., it provides limited in-memory data storage that supports the reuse of data on distributed collections in an application array. It does not include a data management system and is therefore usually deployed on top of Hadoop or some other storage platform.
Spark’s data structure is based on Resilient Distributed Datasets (RDDs) – immutable distributed collections of objects which can contain any type of Python, Java or Scala objects, including user-defined classes. Each dataset is divided into logical partitions which may be computed on different nodes of the cluster.
The main Spark components are:
- Spark Core, the base engine for large-scale parallel and distributed data processing, responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster and interacting with storage systems.
- Spark Streaming for processing real-time streaming data, enabling high-throughput and fault-tolerant stream processing of live data streams.
- Spark SQL for integrating relational processing with the functional programming API.
- GraphX, an API for graphs and graph-parallel computation.
- MLlib, a machine learning library for performing machine learning.
Figure 1: Main Spark Components
The Difference Between Hadoop and Spark
Hadoop and Spark can work together and can also be used separately. That’s because while both deal with the handling of large volumes of data, they have differences. The main parameters for comparison between the two are presented in the following table:
|Performance||Processing speed not a consideration – designed for distributed huge batch operations||Fast, distributed, near real-time analytics|
|Ease of Use||MapReduce has no interactive mode||Has an interactive mode, providing intermediate feedback for queries and actions|
|Stores Data on Disk||Requires a lot of disk space, faster disks and multiple systems||Stores data in-memory, relies on data shuffling between memory spaces and Hadoop/disk|
|Data Processing||Batch processing||Batch, stream, iterative, interactive, graph|
|Fault Tolerance||Can significantly extend operation completion times||RDDs run in parallel. If an RDD is lost, it will automatically be recomputed by using the original transformations.|
|Security||Supports Kerberos and other third-party vendors like LDAP||Supports password authentication + integration with HDFS, YARN and Kerberos|
Since Hadoop and Spark perform processing differently, it’s hard to compare them. However, it’s relevant to consider the processing speed of the two.
Speed was never a consideration in the development of Hadoop, which stores all types of data from multiple sources across a distributed environment and uses MapReduce for batch processing. It’s all about parallel processing over distributed datasets, rather than real-time processing.
On the other hand, because it processes everything in memory, Spark’s in-memory processing makes it fast, delivering near real-time analytics. This much faster processing speed in comparison to Hadoop makes it suitable for streaming analytics and relatively fast for batch and queries on big data (using lazy execution and DAG)
Ease of Use
Spark includes user-friendly APIs for Scala (its native language), Java, Python, R and Spark SQL, which make it particularly easy-to-use with almost no learning curve. It also has an interactive mode that supports both developers and users. Hadoop, on the other hand, has add-ons to support users, but no interactive mode.
Since both Hadoop and Spark are Apache open-source projects, the software is free of charge. Therefore, cost is only associated with the infrastructure or enterprise level management tools. In Hadoop, storage and processing is disk-based, requiring a lot of disk space, faster disks and multiple systems to distribute the disk I/O. On the other hand, Spark’s in-memory processing requires a lot of memory and standard, relatively inexpensive disk speeds and space. Since disk I/O is not used for processing, it requires large amounts of expensive RAM for executing everything in memory. This does not necessarily mean that Hadoop is more cost effective, since Spark technology requires much fewer systems that cost more.
Hadoop data processing is based on batch processing – working with high volumes of data collected over a period and processed at a later stage. Consequently, it’s ideal for processing large, static datasets, particularly archived/historical data, in order to determine trends and statistics over time. Spark data processing is based on stream processing – the fast delivery of real-time information which allows businesses to quickly react to changing business needs in real time.
Hadoop and Spark approach fault tolerance differently. Hadoop’s MapReduce uses TaskTrackers that provide heartbeats to the JobTracker. If a heartbeat is missed, all pending and in-progress operations are rescheduled to another JobTracker, which can significantly extend operation completion times. Spark uses Resilient Distributed Dataset (RDD) building blocks for fault tolerance. Operating in parallel, they refer to any dataset present in external storage systems and shared file systems. Since they can persist data in memory across operations, they make future actions 10 times faster. But if an RDD is lost, it will automatically be recomputed using the original transformations but will need to restart the recompute from the beginning.
For authentication, Hadoop supports Kerberos – which can be difficult to manage – and other third-party vendors like LDAP (Lightweight Directory Access Protocol). It also offers encryption, support of traditional file permissions, access control lists and Service Level Authorization, ensuring that clients have the right permissions for job submission. Spark security currently supports password authentication. It can integrate with HDFS and use HDFS ACLs and file-level permissions, as well as run on YARN and Kubernetes, thereby leveraging the capability of Kerberos.
Who’s Adopting Which Technology?
Hadoop is geared for organizations where instant data analysis results are not required. It’s batch processing is a good and economical solution for analyzing archived data, since it allows parallel and separate processing of huge amounts of data on different data nodes and the gathering of results from each node manager.
Spark is a good solution for organizations seeking near real-time /micro batch analytics and machine learning. Its strength is in allowing in-memory processing and the support of streaming data with distributed processing – a combination that enables the delivery of near real-time data processing and analytics of millions of events per second. In comparison to Hadoop, Spark claims it’s up to 100 times faster for data in RAM and up to 10 times faster for data in storage. And that’s why it’s ideal for business insights.
Where’s the Competition between the Two?
There’s actually no competition. Neither can replace the other and in actual fact, Hadoop and Spark complement each other. Both have features that the other does not possess. Hadoop brings huge datasets under control by commodity systems. Spark provides near real-time, in-memory processing for datasets.
Many organizations are combining the two – Hadoop’s low-cost operation on commodity hardware for disk-heavy operations with Spark’s more costly in-memory processing architecture for high-processing speed, advanced analytics and multiple integration support – to obtain better results.
But is that the end of the story? Far from it.
Is there a Caveat in the Hadoop-Spark Story?
Organizations typically collect operational and external data to a data store, such as Hadoop, where it is stored separately from the actual transactional data and is used for backward-looking analysis.
A good example is financial information collected and used by banks, such as credibility, account balance, duration of credit in months, history of previous credits, purpose of credit, savings accounts/bonds, personal status and gender.
To provide loan approvals or other timely services to customers – such as real-time fraud prevention and next-best offers – with a standard architecture, an accurate process takes a long time. The window of opportunity to stop a fraud, upsell or have customers wait for a response to a loan request and in the meantime perhaps look for other options, will be missed.
Spark on Streaming is Missing the Historical Data
When running analytics on real-time streaming data with Spark, relevant information from historical data on Hadoop or other data stores such as Amazon S3 or Azure Blob Storage is missing.
Spark on Hadoop is Still not Fast Enough
If you’re running Spark on immutable HDFS then you will have the challenge of analyzing time-sensitive data, and not be able to act-in-the moment of decision or for operational efficiency. This is because:
- Data isn’t available for analytics in HDFS for at least an hour after created due to long ETL processes
- Loading data to Spark memory from a data store takes a long time, even though it’s done in a lazy manner
In today’s fast-paced, competitive world, all this translates to performance that cannot meet the demands of services and applications. Relevant decisions cannot be made, since they are based on stale data and insights which may no longer be applicable.
While customers want to accelerate batch analytics, they also require event-driven analytics and machine learning in real-time for smarter insights that can be acted upon instantly.
So there are still challenges. Consider the following:
- What if we could unify transactional/operational data processing, advanced analytics and Artificial Intelligence for real-time analytics on streaming, hot and historical data?
- What if we could trigger event-driven, closed-loop analytics for immediate business impact?
- What if we could deploy Hadoop and Spark configurations in on-premise, on cloud and even hybrid cloud environments?
- What if we could reduce long ETL processes and eliminate unnecessary data duplication?
This is where solutions such as GigaSpaces’ InsightEdge come into play, offering a range of added-value benefits.
Gone are the days when you had to choose between fast but simple or slow but smart insights. Today you can have the best of both worlds – smart and fast insights.
Spark can be optimized for real-time combined workloads on changing data, which is necessary for BI, analytics and ML real-time operational data. This can be accomplished with a unified transactional and analytical processing platform which can collocate the Spark framework with data and the business logic in the same memory space for real-time analytics processing on streaming, hot and historical data – executing at extreme performance and at scale.
Introducing InsightEdge: Benefits of Hadoop + Spark + InsightEdge Configuration
InsightEdge includes a Spark distribution and delivers a range of additional benefits that I’ll address here.
Figure 2: InsightEdge Platform
The InsightEdge platform addresses performance bottlenecks, high availability issues and data movement. It includes a Spark distribution, can connect to an external Spark cluster and leverages the most popular open-source frameworks, such as Spark for machine learning; Caffe or Torch for deep learning, and TensorFlow, the most popular open-source deep-learning framework. Unlike traditional databases, InsightEdge is powered by In-Memory Computing technology with a data grid that can handle massive workloads and processing of hot data, ultimately pushing the asynchronous replication of big-data stores – such as Hadoop – to the background, placing multi-petabytes in cold storage according to defined business rules.
With SQL-99 support and the full Spark dataframe/dataset API, InsightEdge’s data lifecycle management and analytical query tier is essentially a part of the data grid, leveraging shared RDDs, data frames and datasets on the live transactional data and historical data stored on Hadoop.
InsightEdge ensures that Spark runs even faster. It uses the Spark data source API to actually reduce the CPU and RAM resources required by Spark, as well as lowering the network bandwidth between the client and the server. “Pushing down” predicates and aggregations to the InsightEdge data grid engine, leverages InsightEdge’s in-memory grid’s indexes, data modeling and customized aggregation power transparently to the user. In this way, the workload is delegated behind the scenes, between the data grid and Spark.
With InsightEdge’s AnalyticsXtreme Module Interactive queries and machine learning models run simultaneously on both real-time mutable streaming data and on historical data that is stored on data lakes such as Hadoop without requiring a separate data load procedure or data duplication. Hadoop performance is accelerated by 100X.
Single Logical View
A unified API provides a single logical view of data that spans across real-time and hot data (ingested, processed and analyzed on InsightEdge) and historical data (stored on Hadoop), including SQL, Spark dataset/dataframe as well as BI tools like Tableau and Looker.
InsightEdge provides Spark with high availability, ensuring that if a Spark executor fails (which is common in production because of out-of-memory exceptions), the whole process does not have to be restarted, because its state is stored in-memory for immediate recovery.
Continuous Machine Learning Model Training
In production, ML models must be continuously retrained and redeployed to adjust to the constantly changing conditions and environment in order to retain accuracy. In InsightEdge, such continuous machine learning is supported. Transactional data is automatically surfaced as RDDs or data frames, making the training data for the machine learning algorithm readily and effortlessly available as it is ingested from the organization’s web applications or other systems. This enables a continuous learning approach for calibrating statistical, analytical and predictive models for the required accuracy.
Convergence of Multiple Data Types – No Data Movement
On its own, Spark is limited to loading data from the data store, performing transformations, and persisting the transformed data back to the data store for persistency. With the InsightEdge platform, the data, analytics and business logic is co-located, enabling Spark to make changes to the data directly without the need to move the data, thereby reducing the need for multiple data transformations and eliminating excessive data shuffling.
The integration of InsightEdge with Spark delivers a convergence of multiple data types, enabling the accessing of data across all partitions and stateful data sharing across multiple Spark jobs such as analytics, streaming and machine learning.
Figure 3: Convergence of Multiple Data Types
So I hope that I have debunked the myth. It’s no longer about Hadoop or Spark, but the integration of Hadoop and Spark and solutions like InsightEdge. Customers can now address speed, performance and scale challenges in their big data stacks. InsightEdge combines processing and analytics on streaming, transactional and historical data, enabling real-time decision-making based on historical context in sub-second latency irrespective of whether the deployment is on-premise, cloud and hybrid cloud environments.