MapReduce API Framework
Definition
MapReduce is a programming model and processing framework for processing large datasets in
a distributed computing environment, using the concepts of Map and Reduce functions.
Explanation
MapReduce API Framework is a core component of Hadoop that allows developers to
write applications for processing massive amounts of data in parallel across clusters.
It works by splitting the input data into independent chunks, processing them in parallel
(Map phase), and then aggregating results (Reduce phase).
The API provides interfaces such as:
o Mapper – Processes input key/value pairs and produces intermediate key/value
pairs.
o Reducer – Aggregates intermediate data to produce the final output.
o Driver – Manages job configuration and execution.
The framework handles scheduling, fault tolerance, data distribution, and result
collection.
Diagram
mathematica
CopyEdit
Input Data
InputFormat
Mapper
1
Page
(Map function applied)
1. Mapper Class
Definition
The Mapper class in MapReduce processes input key/value pairs and produces intermediate
key/value pairs.
Explanation
It is part of the org.apache.hadoop.mapreduce package.
Signature:
java
CopyEdit
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
Key Methods:
o map(KEYIN key, VALUEIN value, Context context) – Contains the user-
defined map logic.
o setup() – Runs before the map task starts (optional).
o cleanup() – Runs after the map task ends (optional).
Working:
o Reads input data line-by-line (or in blocks).
o Processes and emits intermediate key/value pairs to the framework.
Example: Counting words in a text file.
2. Reducer Class
Definition
The Reducer class in MapReduce processes intermediate key/value pairs from the Mapper and
produces the final output.
Explanation
It is part of the org.apache.hadoop.mapreduce package.
Signature:
java
2
CopyEdit
Page
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
Key Methods:
1. Big Data – Definition & Evolution
Definition
Big Data refers to extremely large and complex datasets that cannot be processed, stored, or
analyzed using traditional data processing tools due to their Volume, Variety, and Velocity.
Evolution of Big Data
1. Before 2000 – Traditional Data
o Data stored in relational databases (RDBMS).
o Small in size (MBs to GBs).
o Handled by tools like SQL.
2. 2000–2010 – Growth of Internet & Social Media
o Rapid increase in data from emails, websites, e-commerce.
o Introduction of distributed computing concepts like Google File System and
MapReduce.
3. 2010–Present – Big Data Era
o Explosion of data from social media, IoT, sensors, smartphones.
o Use of tools like Hadoop, Spark, NoSQL databases.
o AI and ML used for analyzing big datasets in real time.
2. Characteristics of Big Data (5 Vs)
1. Volume – Size of data generated (terabytes, petabytes, exabytes).
Example: Facebook stores petabytes of user data.
2. Velocity – Speed at which data is generated and processed.
3
Page
Example: Stock market data streams in milliseconds.
3. Variety – Different types of data formats (structured, unstructured, semi-structured).
Example: Text, images, audio, videos.
4
Page
k. Data Ingest with Flume
Definition
Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of streaming data (especially
log data) into HDFS.
Explanation
Used to ingest data from multiple sources like
web servers, social media feeds, application logs,
and IoT devices.
Data flow in Flume is built around the Source →
Channel → Sink architecture:
o Source – Captures data from an external system
(e.g., log files).
o Channel – Temporary storage buffer (e.g.,
memory or file-based).
o Sink – Delivers data to the final destination like
HDFS.
Example: Collecting Twitter streaming data and storing it
in HDFS for analysis.
Diagram
5
nginx
Page
CopyEdit
Data Source → Flume Source → Flume Channel → Flume
Sink → HDFS
6
Page
reduce(KEYIN key, Iterable<VALUEIN> values, Context context) –
Contains the user-defined reduce logic.
o setup() – Runs before the reduce task starts (optional).
o cleanup() – Runs after the reduce task ends (optional).
Working:
o Receives all values for a specific key.
o Aggregates them (sum, average, max, etc.) and emits the final key/value pair.
Example: Summing the count of each word from Mapper output.
Simple Flow
mathematica
CopyEdit
Input Data → Mapper → Shuffle & Sort → Reducer → Output Data
7
Page
|
Shuffle & Sort Phase
Reducer
(Reduce function applied)
OutputFormat
Final Output
Advantages
1. Scalability – Can process terabytes or petabytes of data.
2. Fault Tolerance – Automatically reprocesses failed tasks.
3. Simplicity – Developers only focus on Map & Reduce logic.
4. Parallel Processing – High performance on large datasets.
Disadvantages
1. High Latency – Not suitable for real-time processing.
2. Overhead – Significant I/O between Map and Reduce stages.
3. Complex Debugging – Distributed nature makes debugging harder.
4. Less Efficient for Small Data – Setup cost is high for small tasks.
Real-World Use
Used by Google, Yahoo, Facebook for web indexing, log analysis, recommendation
systems, and data mining.
8
Page
2. Data Ingest with Sqoop
Definition
Apache Sqoop is a tool designed for transferring bulk
data between Hadoop and relational databases (like
MySQL, Oracle, PostgreSQL).
Explanation
Mainly used for batch data ingestion from RDBMS
to Hadoop and vice versa.
Sqoop works by:
o Importing data from relational databases into
HDFS, Hive, or HBase.
o Exporting data from Hadoop to relational
databases.
Uses MapReduce jobs internally to perform data
transfer in parallel.
Example: Importing sales records from MySQL into HDFS
for big data analysis.
9
Page
Hadoop Framework
Definition
Hadoop is an open-source framework by the Apache Software Foundation used for storing and
processing large datasets in a distributed computing environment, using commodity hardware.
Explanation
It follows the Master–Slave architecture.
Stores data using HDFS (Hadoop Distributed File System).
Processes data using the MapReduce programming model.
Designed for scalability, fault tolerance, and high throughput.
Can handle both structured and unstructured data.
Components of Hadoop
1. HDFS – Stores large files across multiple machines with replication.
2. MapReduce – Processes data in parallel using Map and Reduce tasks.
3. YARN (Yet Another Resource Negotiator) – Manages cluster resources and job
scheduling.
4. Common Utilities – Java libraries and OS-level abstraction.
Features of Hadoop
1. Open Source – Free to use and customizable.
2. Scalability – Can scale from a single node to thousands of nodes.
3. Fault Tolerance – Data is replicated across nodes to prevent loss.
4. High Throughput – Processes large volumes of data efficiently.
Advantages of Hadoop
1. Cost-Effective – Uses low-cost commodity hardware.
10
2. Handles Big Data Easily – Stores and processes petabytes of data.
3. Flexibility – Can store and process structured & unstructured data.
Page
4. Fast Processing – Parallel processing reduces execution time.