0% found this document useful (0 votes)
6 views10 pages

Bigdata All Mid-1

MapReduce is a programming model within the Hadoop framework for processing large datasets in a distributed environment, utilizing Map and Reduce functions. It allows developers to process data in parallel and includes components like Mapper and Reducer for handling input and output data. Hadoop, as a whole, is an open-source framework designed for scalable and fault-tolerant data storage and processing using commodity hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Bigdata All Mid-1

MapReduce is a programming model within the Hadoop framework for processing large datasets in a distributed environment, utilizing Map and Reduce functions. It allows developers to process data in parallel and includes components like Mapper and Reducer for handling input and output data. Hadoop, as a whole, is an open-source framework designed for scalable and fault-tolerant data storage and processing using commodity hardware.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

MapReduce API Framework

Definition

MapReduce is a programming model and processing framework for processing large datasets in
a distributed computing environment, using the concepts of Map and Reduce functions.

Explanation

 MapReduce API Framework is a core component of Hadoop that allows developers to


write applications for processing massive amounts of data in parallel across clusters.

 It works by splitting the input data into independent chunks, processing them in parallel
(Map phase), and then aggregating results (Reduce phase).

 The API provides interfaces such as:

o Mapper – Processes input key/value pairs and produces intermediate key/value


pairs.

o Reducer – Aggregates intermediate data to produce the final output.

o Driver – Manages job configuration and execution.

 The framework handles scheduling, fault tolerance, data distribution, and result
collection.

Diagram

mathematica

CopyEdit

Input Data

InputFormat

Mapper
1
Page

(Map function applied)


1. Mapper Class
Definition

The Mapper class in MapReduce processes input key/value pairs and produces intermediate
key/value pairs.

Explanation

 It is part of the org.apache.hadoop.mapreduce package.


 Signature:

java
CopyEdit
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

 Key Methods:
o map(KEYIN key, VALUEIN value, Context context) – Contains the user-
defined map logic.
o setup() – Runs before the map task starts (optional).
o cleanup() – Runs after the map task ends (optional).
 Working:
o Reads input data line-by-line (or in blocks).
o Processes and emits intermediate key/value pairs to the framework.

Example: Counting words in a text file.

2. Reducer Class
Definition

The Reducer class in MapReduce processes intermediate key/value pairs from the Mapper and
produces the final output.

Explanation

 It is part of the org.apache.hadoop.mapreduce package.


 Signature:

java
2

CopyEdit
Page

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>


 Key Methods:

1. Big Data – Definition & Evolution

Definition

Big Data refers to extremely large and complex datasets that cannot be processed, stored, or
analyzed using traditional data processing tools due to their Volume, Variety, and Velocity.

Evolution of Big Data

1. Before 2000 – Traditional Data

o Data stored in relational databases (RDBMS).

o Small in size (MBs to GBs).

o Handled by tools like SQL.

2. 2000–2010 – Growth of Internet & Social Media

o Rapid increase in data from emails, websites, e-commerce.

o Introduction of distributed computing concepts like Google File System and


MapReduce.

3. 2010–Present – Big Data Era

o Explosion of data from social media, IoT, sensors, smartphones.

o Use of tools like Hadoop, Spark, NoSQL databases.

o AI and ML used for analyzing big datasets in real time.

2. Characteristics of Big Data (5 Vs)

1. Volume – Size of data generated (terabytes, petabytes, exabytes).

Example: Facebook stores petabytes of user data.

2. Velocity – Speed at which data is generated and processed.


3
Page

Example: Stock market data streams in milliseconds.

3. Variety – Different types of data formats (structured, unstructured, semi-structured).


Example: Text, images, audio, videos.

4
Page
k. Data Ingest with Flume
Definition
Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of streaming data (especially
log data) into HDFS.

Explanation
 Used to ingest data from multiple sources like
web servers, social media feeds, application logs,
and IoT devices.
 Data flow in Flume is built around the Source →
Channel → Sink architecture:
o Source – Captures data from an external system
(e.g., log files).
o Channel – Temporary storage buffer (e.g.,
memory or file-based).
o Sink – Delivers data to the final destination like
HDFS.
Example: Collecting Twitter streaming data and storing it
in HDFS for analysis.

Diagram
5

nginx
Page
CopyEdit
Data Source → Flume Source → Flume Channel → Flume
Sink → HDFS

6
Page
reduce(KEYIN key, Iterable<VALUEIN> values, Context context) –
Contains the user-defined reduce logic.
o setup() – Runs before the reduce task starts (optional).
o cleanup() – Runs after the reduce task ends (optional).
 Working:
o Receives all values for a specific key.
o Aggregates them (sum, average, max, etc.) and emits the final key/value pair.

Example: Summing the count of each word from Mapper output.

Simple Flow
mathematica
CopyEdit
Input Data → Mapper → Shuffle & Sort → Reducer → Output Data

7
Page
|

Shuffle & Sort Phase

Reducer

(Reduce function applied)

OutputFormat

Final Output

Advantages

1. Scalability – Can process terabytes or petabytes of data.

2. Fault Tolerance – Automatically reprocesses failed tasks.

3. Simplicity – Developers only focus on Map & Reduce logic.

4. Parallel Processing – High performance on large datasets.

Disadvantages

1. High Latency – Not suitable for real-time processing.

2. Overhead – Significant I/O between Map and Reduce stages.

3. Complex Debugging – Distributed nature makes debugging harder.

4. Less Efficient for Small Data – Setup cost is high for small tasks.

Real-World Use

 Used by Google, Yahoo, Facebook for web indexing, log analysis, recommendation
systems, and data mining.
8
Page
2. Data Ingest with Sqoop
Definition
Apache Sqoop is a tool designed for transferring bulk
data between Hadoop and relational databases (like
MySQL, Oracle, PostgreSQL).

Explanation
 Mainly used for batch data ingestion from RDBMS
to Hadoop and vice versa.
 Sqoop works by:
o Importing data from relational databases into
HDFS, Hive, or HBase.
o Exporting data from Hadoop to relational
databases.
 Uses MapReduce jobs internally to perform data
transfer in parallel.
Example: Importing sales records from MySQL into HDFS
for big data analysis.
9
Page
Hadoop Framework

Definition

Hadoop is an open-source framework by the Apache Software Foundation used for storing and
processing large datasets in a distributed computing environment, using commodity hardware.

Explanation

 It follows the Master–Slave architecture.

 Stores data using HDFS (Hadoop Distributed File System).

 Processes data using the MapReduce programming model.

 Designed for scalability, fault tolerance, and high throughput.

 Can handle both structured and unstructured data.

Components of Hadoop

1. HDFS – Stores large files across multiple machines with replication.

2. MapReduce – Processes data in parallel using Map and Reduce tasks.

3. YARN (Yet Another Resource Negotiator) – Manages cluster resources and job
scheduling.

4. Common Utilities – Java libraries and OS-level abstraction.

Features of Hadoop

1. Open Source – Free to use and customizable.

2. Scalability – Can scale from a single node to thousands of nodes.

3. Fault Tolerance – Data is replicated across nodes to prevent loss.

4. High Throughput – Processes large volumes of data efficiently.

Advantages of Hadoop

1. Cost-Effective – Uses low-cost commodity hardware.


10

2. Handles Big Data Easily – Stores and processes petabytes of data.

3. Flexibility – Can store and process structured & unstructured data.


Page

4. Fast Processing – Parallel processing reduces execution time.

You might also like