0% found this document useful (0 votes)

6 views10 pages

Bigdata All Mid-1

MapReduce is a programming model within the Hadoop framework for processing large datasets in a distributed environment, utilizing Map and Reduce functions. It allows developers to process data in parallel and includes components like Mapper and Reducer for handling input and output data. Hadoop, as a whole, is an open-source framework designed for scalable and fault-tolerant data storage and processing using commodity hardware.

Uploaded by

charanpeethani143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views10 pages

Bigdata All Mid-1

Uploaded by

charanpeethani143

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

MapReduce API Framework

Definition

MapReduce is a programming model and processing framework for processing large datasets in
a distributed computing environment, using the concepts of Map and Reduce functions.

Explanation

 MapReduce API Framework is a core component of Hadoop that allows developers to

write applications for processing massive amounts of data in parallel across clusters.

 It works by splitting the input data into independent chunks, processing them in parallel
(Map phase), and then aggregating results (Reduce phase).

 The API provides interfaces such as:

o Mapper – Processes input key/value pairs and produces intermediate key/value

pairs.

o Reducer – Aggregates intermediate data to produce the final output.

o Driver – Manages job configuration and execution.

 The framework handles scheduling, fault tolerance, data distribution, and result
collection.

Diagram

mathematica

CopyEdit

Input Data

InputFormat

Mapper
1
Page

(Map function applied)

1. Mapper Class
Definition

The Mapper class in MapReduce processes input key/value pairs and produces intermediate
key/value pairs.

Explanation

 It is part of the org.apache.hadoop.mapreduce package.

 Signature:

java
CopyEdit
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

 Key Methods:
o map(KEYIN key, VALUEIN value, Context context) – Contains the user-
defined map logic.
o setup() – Runs before the map task starts (optional).
o cleanup() – Runs after the map task ends (optional).
 Working:
o Reads input data line-by-line (or in blocks).
o Processes and emits intermediate key/value pairs to the framework.

Example: Counting words in a text file.

2. Reducer Class
Definition

The Reducer class in MapReduce processes intermediate key/value pairs from the Mapper and
produces the final output.

Explanation

 It is part of the org.apache.hadoop.mapreduce package.

 Signature:

java
2

CopyEdit
Page

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

 Key Methods:

1. Big Data – Definition & Evolution

Definition

Big Data refers to extremely large and complex datasets that cannot be processed, stored, or
analyzed using traditional data processing tools due to their Volume, Variety, and Velocity.

Evolution of Big Data

1. Before 2000 – Traditional Data

o Data stored in relational databases (RDBMS).

o Small in size (MBs to GBs).

o Handled by tools like SQL.

2. 2000–2010 – Growth of Internet & Social Media

o Rapid increase in data from emails, websites, e-commerce.

o Introduction of distributed computing concepts like Google File System and

MapReduce.

3. 2010–Present – Big Data Era

o Explosion of data from social media, IoT, sensors, smartphones.

o Use of tools like Hadoop, Spark, NoSQL databases.

o AI and ML used for analyzing big datasets in real time.

2. Characteristics of Big Data (5 Vs)

1. Volume – Size of data generated (terabytes, petabytes, exabytes).

Example: Facebook stores petabytes of user data.

2. Velocity – Speed at which data is generated and processed.

3
Page

Example: Stock market data streams in milliseconds.

3. Variety – Different types of data formats (structured, unstructured, semi-structured).

Example: Text, images, audio, videos.

4
Page
k. Data Ingest with Flume
Definition
Apache Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and
moving large amounts of streaming data (especially
log data) into HDFS.

Explanation
 Used to ingest data from multiple sources like
web servers, social media feeds, application logs,
and IoT devices.
 Data flow in Flume is built around the Source →
Channel → Sink architecture:
o Source – Captures data from an external system
(e.g., log files).
o Channel – Temporary storage buffer (e.g.,
memory or file-based).
o Sink – Delivers data to the final destination like
HDFS.
Example: Collecting Twitter streaming data and storing it
in HDFS for analysis.

Diagram
5

nginx
Page
CopyEdit
Data Source → Flume Source → Flume Channel → Flume
Sink → HDFS

6
Page
reduce(KEYIN key, Iterable<VALUEIN> values, Context context) –
Contains the user-defined reduce logic.
o setup() – Runs before the reduce task starts (optional).
o cleanup() – Runs after the reduce task ends (optional).
 Working:
o Receives all values for a specific key.
o Aggregates them (sum, average, max, etc.) and emits the final key/value pair.

Example: Summing the count of each word from Mapper output.

Simple Flow
mathematica
CopyEdit
Input Data → Mapper → Shuffle & Sort → Reducer → Output Data

7
Page
|

Shuffle & Sort Phase

Reducer

(Reduce function applied)

OutputFormat

Final Output

Advantages

1. Scalability – Can process terabytes or petabytes of data.

2. Fault Tolerance – Automatically reprocesses failed tasks.

3. Simplicity – Developers only focus on Map & Reduce logic.

4. Parallel Processing – High performance on large datasets.

Disadvantages

1. High Latency – Not suitable for real-time processing.

2. Overhead – Significant I/O between Map and Reduce stages.

3. Complex Debugging – Distributed nature makes debugging harder.

4. Less Efficient for Small Data – Setup cost is high for small tasks.

Real-World Use

 Used by Google, Yahoo, Facebook for web indexing, log analysis, recommendation
systems, and data mining.
8
Page
2. Data Ingest with Sqoop
Definition
Apache Sqoop is a tool designed for transferring bulk
data between Hadoop and relational databases (like
MySQL, Oracle, PostgreSQL).

Explanation
 Mainly used for batch data ingestion from RDBMS
to Hadoop and vice versa.
 Sqoop works by:
o Importing data from relational databases into
HDFS, Hive, or HBase.
o Exporting data from Hadoop to relational
databases.
 Uses MapReduce jobs internally to perform data
transfer in parallel.
Example: Importing sales records from MySQL into HDFS
for big data analysis.
9
Page
Hadoop Framework

Definition

Hadoop is an open-source framework by the Apache Software Foundation used for storing and
processing large datasets in a distributed computing environment, using commodity hardware.

Explanation

 It follows the Master–Slave architecture.

 Stores data using HDFS (Hadoop Distributed File System).

 Processes data using the MapReduce programming model.

 Designed for scalability, fault tolerance, and high throughput.

 Can handle both structured and unstructured data.

Components of Hadoop

1. HDFS – Stores large files across multiple machines with replication.

2. MapReduce – Processes data in parallel using Map and Reduce tasks.

3. YARN (Yet Another Resource Negotiator) – Manages cluster resources and job
scheduling.

4. Common Utilities – Java libraries and OS-level abstraction.

Features of Hadoop

1. Open Source – Free to use and customizable.

2. Scalability – Can scale from a single node to thousands of nodes.

3. Fault Tolerance – Data is replicated across nodes to prevent loss.

4. High Throughput – Processes large volumes of data efficiently.

Advantages of Hadoop

1. Cost-Effective – Uses low-cost commodity hardware.

2. Handles Big Data Easily – Stores and processes petabytes of data.

3. Flexibility – Can store and process structured & unstructured data.

Page

4. Fast Processing – Parallel processing reduces execution time.

DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
M5
No ratings yet
M5
18 pages
What Is MapReduce
No ratings yet
What Is MapReduce
4 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Unit-III Big Data
No ratings yet
Unit-III Big Data
10 pages
Big Data
No ratings yet
Big Data
120 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Hadoop & MapReduce Overview
No ratings yet
Hadoop & MapReduce Overview
18 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
3 Unit
No ratings yet
3 Unit
17 pages
Unit 5
No ratings yet
Unit 5
35 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Big Data & Hadoop Architecture Guide
50% (2)
Big Data & Hadoop Architecture Guide
168 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
BDA Notes
No ratings yet
BDA Notes
15 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
Unit 3
No ratings yet
Unit 3
5 pages
Unit 3 & 4 Big Data
No ratings yet
Unit 3 & 4 Big Data
18 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
No ratings yet
18CS72-Big Data and Analytics 3rd Internal QP 7th Semester - Scheme of Evaluation
14 pages
BDA Unit 3
No ratings yet
BDA Unit 3
7 pages
Hadoop MapReduce for Big Data
No ratings yet
Hadoop MapReduce for Big Data
5 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Biggdata
No ratings yet
Biggdata
24 pages
Hadoop
No ratings yet
Hadoop
34 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
MapReduce for Big Data Developers
No ratings yet
MapReduce for Big Data Developers
9 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
CN Syllabus
No ratings yet
CN Syllabus
2 pages
bsd II unit
No ratings yet
bsd II unit
21 pages
DevOps Unit-2 2sides 4S
No ratings yet
DevOps Unit-2 2sides 4S
8 pages
DevOps Unit-2 2sides 4S
No ratings yet
DevOps Unit-2 2sides 4S
8 pages
ACN All Mid-1
No ratings yet
ACN All Mid-1
8 pages
A Review On HADOOP MAPREDUCE-A Job Aware Scheduling Technology
No ratings yet
A Review On HADOOP MAPREDUCE-A Job Aware Scheduling Technology
5 pages
Ankush IITG Resume MLE
No ratings yet
Ankush IITG Resume MLE
1 page
Introduction to Text Mining Course
No ratings yet
Introduction to Text Mining Course
45 pages
Big Data Analytics Course
No ratings yet
Big Data Analytics Course
3 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
3 pages
Big Data Evolution & Data Wrangling
No ratings yet
Big Data Evolution & Data Wrangling
56 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
The Art of Data Science
No ratings yet
The Art of Data Science
12 pages
Big Data Report
No ratings yet
Big Data Report
18 pages
Bca 7th 8th Semester Syllabus
No ratings yet
Bca 7th 8th Semester Syllabus
50 pages
BDA Textbook Main
No ratings yet
BDA Textbook Main
370 pages
2021 Scheme 7th and 8th Scheme and Syllabus-1
No ratings yet
2021 Scheme 7th and 8th Scheme and Syllabus-1
37 pages
BDA Complete Notes
100% (1)
BDA Complete Notes
88 pages
GoogleCloud PER
No ratings yet
GoogleCloud PER
8 pages
1c MR YARN Transcript
No ratings yet
1c MR YARN Transcript
4 pages
Healt Care
No ratings yet
Healt Care
22 pages
Full Mastering Spark With R The Complete Guide To Large Scale Analysis and Modeling 1st Edition Javier Luraschi Ebook All Chapters
100% (2)
Full Mastering Spark With R The Complete Guide To Large Scale Analysis and Modeling 1st Edition Javier Luraschi Ebook All Chapters
55 pages
Cloud
No ratings yet
Cloud
11 pages
Open I Maj Tutorial
No ratings yet
Open I Maj Tutorial
89 pages
ICGTETM 2016 Proceedings PDF
No ratings yet
ICGTETM 2016 Proceedings PDF
690 pages
Hands On Exercises 2013
No ratings yet
Hands On Exercises 2013
51 pages
Syllabus 8TH Sem
No ratings yet
Syllabus 8TH Sem
6 pages
Enterprise Cloud Computing Technology Architecture Applications 1st Edition Gautam Shroff PDF Download
No ratings yet
Enterprise Cloud Computing Technology Architecture Applications 1st Edition Gautam Shroff PDF Download
62 pages
MapReduce for Big Data Analysis
No ratings yet
MapReduce for Big Data Analysis
59 pages
CASE STUDY On Application of Hadoop
No ratings yet
CASE STUDY On Application of Hadoop
16 pages
Hadoop The Definitive Guide Third Edition Tom White Download
100% (12)
Hadoop The Definitive Guide Third Edition Tom White Download
82 pages
Big Data Processing with Pig
No ratings yet
Big Data Processing with Pig
12 pages
Hadoop: An Elephant Can't Jump. But Can Carry Heavy Load
No ratings yet
Hadoop: An Elephant Can't Jump. But Can Carry Heavy Load
35 pages

Bigdata All Mid-1

Uploaded by

Bigdata All Mid-1

Uploaded by

MapReduce API Framework

 MapReduce API Framework is a core component of Hadoop that allows developers to

 The API provides interfaces such as:

o Mapper – Processes input key/value pairs and produces intermediate key/value

o Reducer – Aggregates intermediate data to produce the final output.

o Driver – Manages job configuration and execution.

(Map function applied)

 It is part of the org.apache.hadoop.mapreduce package.

Example: Counting words in a text file.

 It is part of the org.apache.hadoop.mapreduce package.

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

1. Big Data – Definition & Evolution

Evolution of Big Data

1. Before 2000 – Traditional Data

o Data stored in relational databases (RDBMS).

o Small in size (MBs to GBs).

o Handled by tools like SQL.

2. 2000–2010 – Growth of Internet & Social Media

o Rapid increase in data from emails, websites, e-commerce.

o Introduction of distributed computing concepts like Google File System and

3. 2010–Present – Big Data Era

o Explosion of data from social media, IoT, sensors, smartphones.

o Use of tools like Hadoop, Spark, NoSQL databases.

o AI and ML used for analyzing big datasets in real time.

2. Characteristics of Big Data (5 Vs)

1. Volume – Size of data generated (terabytes, petabytes, exabytes).

Example: Facebook stores petabytes of user data.

2. Velocity – Speed at which data is generated and processed.

Example: Stock market data streams in milliseconds.

3. Variety – Different types of data formats (structured, unstructured, semi-structured).

Example: Summing the count of each word from Mapper output.

Shuffle & Sort Phase

(Reduce function applied)

1. Scalability – Can process terabytes or petabytes of data.

2. Fault Tolerance – Automatically reprocesses failed tasks.

3. Simplicity – Developers only focus on Map & Reduce logic.

4. Parallel Processing – High performance on large datasets.

1. High Latency – Not suitable for real-time processing.

2. Overhead – Significant I/O between Map and Reduce stages.

3. Complex Debugging – Distributed nature makes debugging harder.

 It follows the Master–Slave architecture.

 Stores data using HDFS (Hadoop Distributed File System).

 Processes data using the MapReduce programming model.

 Designed for scalability, fault tolerance, and high throughput.

 Can handle both structured and unstructured data.

1. HDFS – Stores large files across multiple machines with replication.

2. MapReduce – Processes data in parallel using Map and Reduce tasks.

4. Common Utilities – Java libraries and OS-level abstraction.

1. Open Source – Free to use and customizable.

2. Scalability – Can scale from a single node to thousands of nodes.

3. Fault Tolerance – Data is replicated across nodes to prevent loss.

4. High Throughput – Processes large volumes of data efficiently.

1. Cost-Effective – Uses low-cost commodity hardware.

2. Handles Big Data Easily – Stores and processes petabytes of data.

3. Flexibility – Can store and process structured & unstructured data.

4. Fast Processing – Parallel processing reduces execution time.

You might also like