Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
Big data became popular about a decade ago. The falling cost of storage led many
enterprises to retain much of the data they ingested or generated so they could mine it for
key business insights.
Analyzing all that data has driven the development of a variety of big data frameworks
capable of sifting through masses of data, starting with Hadoop. Big data frameworks were
initially used for data at rest in a data warehouse or data lake, but a more recent trend is to
process data in real time as it streams in from multiple sources.
A big data framework is a collection of software components that can be used to build a
distributed system for the processing of large data sets, comprising structured,
semistructured or unstructured data. These data sets can be from multiple sources and
range in size from terabytes to petabytes to exabytes.
1
In this handbook:
Such frameworks often play a part in high-performance computing (HPC), a technology that
Compare Hadoop vs.
can address difficult problems in fields as diverse as materials science, engineering or
Spark vs. Kafka for Your
Big Data Strategy financial modeling. Finding answers to these problems often lies in sifting through as much
relevant data as possible.
The most well-known big data framework is Apache Hadoop. Other big data frameworks
include Spark, Kafka, Storm and Flink, which are all -- along with Hadoop -- open source
projects developed by the Apache Software Foundation. Apache Hive, originally developed
by Facebook, is also a big data framework.
The chief components of Apache Hadoop are the Hadoop Distributed File System (HDFS) and
a data processing engine that implements the MapReduce program to filter and sort data.
Also included is YARN, a resource manager for the Hadoop cluster.
Apache Spark can also run on HDFS or an alternative distributed file system. It was
developed to perform faster than MapReduce by processing and retaining data in memory
for subsequent steps, rather than writing results straight back to storage. This can make
Spark up to 100 times faster than Hadoop for smaller workloads.
2
In this handbook:
However, Hadoop MapReduce can work with much larger data sets than Spark, especially
Compare Hadoop vs.
those where the size of the entire data set exceeds available memory. If an organization has
Spark vs. Kafka for Your
Big Data Strategy a very large volume of data and processing is not time-sensitive, Hadoop may be the better
choice.
3
In this handbook:
Spark is better for applications where an organization needs answers quickly, such as those
Compare Hadoop vs.
involving iterative or graph processing. Also known as network analysis, this technology
Spark vs. Kafka for Your
Big Data Strategy analyzes relations among entities such as customers and products.
Apache Kafka is a distributed event streaming platform designed to process real-time data
feeds. This means data is processed as it passes through the system.
Like Hadoop, Kafka runs on a cluster of server nodes, making it scalable. Some server nodes
form a storage layer, called brokers, while others handle the continuous import and export
of data streams.
Strictly speaking, Kafka is not a rival platform to Hadoop. Organizations can use it alongside
Hadoop as part of an overall application architecture where it handles and feeds incoming
data streams into a data lake for a framework, such as Hadoop, to process.
4
In this handbook:
Because of its ability to handle thousands of messages per second, Kafka is useful for
applications such as website activity tracking or telemetry data collection in large-scale IoT
deployments.
5
In this handbook:
WHAT IS THE DIFFERENCE BETWEEN KAFKA AND SPARK?
Compare Hadoop vs.
Spark vs. Kafka for Your
Apache Spark is a general processing engine developed to perform both batch processing --
Big Data Strategy
similar to MapReduce -- and workloads such as streaming, interactive queries and machine
learning (ML).
Kafka was originally developed at social network LinkedIn to analyze the connections among
its millions of users. It is perhaps best viewed as a framework capable of capturing data in
real time from numerous sources and sorting it into topics to be analyzed for insights into
the data.
That analysis is likely to be performed using a tool such as Spark, which is a cluster
computing framework that can execute code developed in languages such as Java, Python or
Scala. Spark also includes Spark SQL, which provides support for querying structured and
semistructured data; and Spark MLlib, a machine learning library for building and operating
ML pipelines.
6
In this handbook:
7
In this handbook:
OTHER BIG DATA FRAMEWORKS
Compare Hadoop vs.
Spark vs. Kafka for Your
Here are some other big data frameworks that might be of interest.
Big Data Strategy
Apache Hive enables SQL developers to use Hive Query Language (HQL) statements that are
similar to standard SQL employed for data query and analysis. Hive can run on HDFS and is
best suited for data warehousing tasks, such as extract, transform and load (ETL), reporting
and data analysis.
8
In this handbook:
Apache Flink combines stateful stream processing with the ability to handle ETL and batch
Compare Hadoop vs.
processing jobs. This makes it a good fit for event-driven workloads, such as user interactions
Spark vs. Kafka for Your
Big Data Strategy on websites or online purchase orders. Like Hive, Flink can run on HDFS or other data storage
layers.