0% found this document useful (0 votes)

17 views33 pages

Week 14

The document provides an introduction to Apache Hadoop and Spark, focusing on their roles in handling big datasets and data analysis. It outlines the components of Hadoop, including HDFS and MapReduce, and highlights Spark's capabilities and advantages over traditional MapReduce solutions. Additionally, it discusses the growth of data, cloud computing, and the importance of new data processing and machine learning methods.

Uploaded by

ahmadfraz0010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views33 pages

Week 14

Uploaded by

ahmadfraz0010

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Apache Hadoop and Spark: Introduction

and Use Cases for Data Analysis

Afzal Godil
Information Access Division, ITL, NIST
Outline

• Growth of big datasets

• Introduction to Apache Hadoop and Spark for developing
applications
• Components of Hadoop, HDFS, MapReduce and HBase
• Capabilities of Spark and the differences from a typical
MapReduce solution
• Some Spark use cases for data analysis
Data

• The Large Hadron Collider produces about 30 petabytes of

data per year
• Facebook’s data is growing at 8 petabytes per month
• The New York stock exchange generates about 4 terabyte of
data per day
• YouTube had around 80 petabytes of storage in 2012
• Internet Archive stores around 19 petabytes of data
Cloud and Distributed Computing

• The second trend is pervasiveness of cloud-based storage and

computational resources
– For processing of these big datasets
• Cloud characteristics
– Provide a scalable standard environment
– On-demand computing
– Pay as you need
– Dynamically scalable
– Cheaper
Data Processing and Machine learning Methods

• Data processing (third trend)

– Traditional ETL (extract, transform, load)
– Data Stores (HBase, ……..) Data
– Tools for processing of streaming, Processing
multimedia & batch data ETL
• Machine Learning (fourth trend) (extract,
transform,
– Classification
load)
– Regression
– Clustering Machine
Big Datasets
– Collaborative filtering Learning

Working at the Intersection of these

four trends is very exciting and
challenging and require new ways to Distributed
store and process Big Data Computing
Hadoop Ecosystem
• Enable Scalability
– on commodity hardware
• Handle Fault Tolerance
• Can Handle a Variety of Data type
– Text, Graph, Streaming Data, Images,…
• Shared Environment
• Provides Value
– Cost
how does Hadoop vary from
HDFS?
Hadoop Ecosystem
A
Layer Diagram
B C

D
Apache Hadoop Basic Modules
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
Other Modules: Zookeeper, Impala,
• Hadoop MapReduce Oozie, etc.

Spark, Storm, Tez,

etc.
Pig Hive
Non-relational

Scripting SQL Like Query

Database

HBase

MapReduce Others
Distributed Processing Distributed Processing
Yarn
Resource Manager

HDFS Distributed File System (Storage)

Hadoop HDFS
• Hadoop distributed File System (based on Google File System (GFS) paper,
2004)
– Serves as the distributed file system for most tools in the Hadoop
ecosystem
– Scalability for large data sets
– Reliability to cope with hardware failures
• HDFS good for:
– Large files
– Streaming data
• Not good for:
– Lots of small files Single Hadoop cluster with 5000 servers
and 250 petabytes of data
– Random access to files
– Low latency access
Design of Hadoop Distributed File System (HDFS)

• Master-Slave design
• Master Node
– Single NameNode for managing metadata
• Slave Nodes
– Multiple DataNodes for storing data
• Other
– Secondary NameNode as a backup
HDFS Architecture
NameNode keeps the metadata, the name, location and directory
DataNode provide storage for blocks of data

Secondary
Client NameNode
NameNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data

HDFS
What happens; if node(s) fail?
Replication of Blocks for fault tolerance

File B1 B2 B3 B4

Node Node Node Node

B1 B2 B4 B3

Node Node Node

B1 Node
B3 B1 B2 B4

Node Node Node Node

B4 B3 B1 B2
HDFS

• HDFS files are divided into blocks

– It’s the basic unit of read/write
– Default size is 64MB, could be larger (128MB)
– Hence makes HDFS good for storing larger files
• HDFS blocks are replicated multiple times
– One block stored at multiple location, also at different
racks (usually 3 times)
– This makes HDFS storage fault tolerant and faster to
read
Few HDFS Shell commands
Create a directory in HDFS
• hadoop fs -mkdir /user/godil/dir1

List the content of a directory

• hadoop fs -ls /user/godil

Upload and download a file in HDFS

• hadoop fs -put /home/godil/file.txt /user/godil/datadir/
• hadoop fs -get /user/godil/datadir/file.txt /home/

Look at the content of a file

• Hadoop fs -cat /user/godil/datadir/book.txt

Many more commands, similar to Unix

MapReduce: Simple Programming for Big Data
Based on Google’s MR paper (2004)

• MapReduce is simple programming paradigm for the Hadoop

ecosystem
• Traditional parallel programming requires expertise of different
computing/systems concepts
– examples: multithreads, synchronization mechanisms (locks,
semaphores, and monitors )
– incorrect use: can crash your program, get incorrect results, or
severely impact performance
– Usually not fault tolerant to hardware failure
• The MapReduce programming model greatly simplifies running code
in parallel
– you don't have to deal with any of above issues
– only need to create, map and reduce functions
Map Reduce Paradigm

• Map and Reduce are based on functional programming

Apply function Map: Reduce:

Apply a function to all the elements of Combine all the elements of list for a
List summary

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

square x = x * x A = reduce (+) list1
list2=Map square(list1) Print A
print list2 -> 15
-> [1,4,9,16,25]

Input Map Reduce Output

MapReduce Word Count Example
(I,1)
I am Sam Node Node
Map (am,1) Reduce
File (Sam,1)
B E
B
A Sam I am Node Node
Map Reduce
C Shuffle
D A & F (I,2)
Sort (am,2)
……… Node Node (Sam,2)
Map Reduce (…,..)
C G (..,..)
(I,1)
(am,1)
……… Node Node
Map (Sam,1) Reduce

D H
Shortcoming of MapReduce
• Forces your data processing into Map and Reduce
– Other workflows missing include join, filter, flatMap,
groupByKey, union, intersection, …
• Based on “Acyclic Data Flow” from Disk to Disk (HDFS)
• Read and write to Disk before and after Map and Reduce
(stateless machine)
– Not efficient for iterative tasks, i.e. Machine Learning
• Only Java natively supported
– Support for others languages needed
• Only for Batch processing
– Interactivity, streaming data
One Solution is Apache Spark
• A new general framework, which solves many of the short comings of
MapReduce
• It capable of leveraging the Hadoop ecosystem, e.g. HDFS, YARN, HBase,
S3, …
• Has many other workflows, i.e. join, filter, flatMapdistinct, groupByKey,
reduceByKey, sortByKey, collect, count, first…
– (around 30 efficient distributed operations)
• In-memory caching of data (for iterative, graph, and machine learning
algorithms, etc.)
• Native Scala, Java, Python, and R support
• Supports interactive shells for exploratory data analysis
• Spark API is extremely simple to use
• Developed at AMPLab UC Berkeley, now by Databricks.com
Spark Uses Memory instead of Disk
Hadoop: Use Disk for Data Sharing

HDFS HDFS HDFS

read HDFS
read Write
Write
Iteration1 Iteration2

Spark: In-Memory Data Sharing

HDFS read

Iteration1 Iteration2

Hadoop and Spark for Big Data Analysis
No ratings yet
Hadoop and Spark for Big Data Analysis
48 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
BDA Module 3
No ratings yet
BDA Module 3
69 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Unit 2
No ratings yet
Unit 2
9 pages
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
100% (1)
HADOOP and PYTHON For BEGINNERS - 2 BOOKS in 1 - Learn Coding Fast! HADOOP and PYTHON Crash Course, A QuickStart Guide, Tutorial Book by Program Examples, in Easy Steps!
89 pages
BDA Unit-4
No ratings yet
BDA Unit-4
47 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Spark
No ratings yet
Hadoop Spark
31 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
21 pages
Big Data
No ratings yet
Big Data
67 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Introduction to Hadoop Ecosystem
No ratings yet
Introduction to Hadoop Ecosystem
50 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
BD Unit-02
No ratings yet
BD Unit-02
16 pages
Importance of Big Data and Hadoop
No ratings yet
Importance of Big Data and Hadoop
13 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
An Overview of The Hadoop Ecosystem
No ratings yet
An Overview of The Hadoop Ecosystem
9 pages
BDA Module2
No ratings yet
BDA Module2
83 pages
Introduction To
No ratings yet
Introduction To
7 pages
BIG Data - Unit - 2
No ratings yet
BIG Data - Unit - 2
24 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Hadoop for Big Data Beginners
No ratings yet
Hadoop for Big Data Beginners
87 pages
Unstructured Data in Hadoop Analysis
No ratings yet
Unstructured Data in Hadoop Analysis
57 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
3 pages
Hadoop for Big Data Professionals
No ratings yet
Hadoop for Big Data Professionals
24 pages
Unit IV Hadoop
No ratings yet
Unit IV Hadoop
90 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
Module 2 Hadoop Final
No ratings yet
Module 2 Hadoop Final
98 pages
Spark Deep Dive
No ratings yet
Spark Deep Dive
34 pages
Unit II Hadoop and Map Reduce Overview
No ratings yet
Unit II Hadoop and Map Reduce Overview
136 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
26 pages
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
No ratings yet
KCC Institute of Technology and Management: Big Data and Analytics Lab File BCDS651
30 pages
MapReduce Phases in Hadoop Ecosystem
No ratings yet
MapReduce Phases in Hadoop Ecosystem
28 pages
wk8 Final
No ratings yet
wk8 Final
39 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Hadoop Lab
100% (1)
Hadoop Lab
32 pages
Big Data?: Hadoop?
No ratings yet
Big Data?: Hadoop?
2 pages
Unit-2 Hadoop and MapReduce
No ratings yet
Unit-2 Hadoop and MapReduce
32 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
BDA Unit 4 PDF
No ratings yet
BDA Unit 4 PDF
31 pages
METHOD of STATEMENT For Grounding System
100% (1)
METHOD of STATEMENT For Grounding System
12 pages
ASCE 7-22 CH 06com - For PC - Sou
No ratings yet
ASCE 7-22 CH 06com - For PC - Sou
143 pages
Magnetostrictive Transducer MAZ
No ratings yet
Magnetostrictive Transducer MAZ
8 pages
RFID Tag Antenna Design and Its EM Simulation Based Measurement Method
No ratings yet
RFID Tag Antenna Design and Its EM Simulation Based Measurement Method
4 pages
IBPS Clerk Mains 2017 Quant Solutions
No ratings yet
IBPS Clerk Mains 2017 Quant Solutions
6 pages
741CN Etc
No ratings yet
741CN Etc
10 pages
Le Chatelier Apparatus for Cement Testing
No ratings yet
Le Chatelier Apparatus for Cement Testing
2 pages
Approved - 400kV LA
No ratings yet
Approved - 400kV LA
22 pages
Making Buffers: Overview
No ratings yet
Making Buffers: Overview
13 pages
Question Paper
No ratings yet
Question Paper
16 pages
Measurement Conversion Tables: Inches Fractions To Decimals, CM & MM Inches To Cm. & MM
No ratings yet
Measurement Conversion Tables: Inches Fractions To Decimals, CM & MM Inches To Cm. & MM
2 pages
Diagramme LNG - en
No ratings yet
Diagramme LNG - en
22 pages
JEE Mock Test: Units & Measurements
No ratings yet
JEE Mock Test: Units & Measurements
2 pages
Design of Restrained Retaining Walls
No ratings yet
Design of Restrained Retaining Walls
28 pages
Quarter 2, Module 3
100% (1)
Quarter 2, Module 3
4 pages
9851 3219 01 Cop 1800HD+ Series
No ratings yet
9851 3219 01 Cop 1800HD+ Series
2 pages
Grade 12 Physics Unit 4 en
No ratings yet
Grade 12 Physics Unit 4 en
92 pages
Understanding Data Independence
No ratings yet
Understanding Data Independence
3 pages
Alkalinity
No ratings yet
Alkalinity
13 pages
Citing - Natbib Sorting and Citation Order by Appearance - TeX - LaTeX Stack Exchange
No ratings yet
Citing - Natbib Sorting and Citation Order by Appearance - TeX - LaTeX Stack Exchange
1 page
Server-Side Web Programming: The Request and Response Objects
No ratings yet
Server-Side Web Programming: The Request and Response Objects
15 pages
Method Statement For Concreting & Concrete Testing
No ratings yet
Method Statement For Concreting & Concrete Testing
3 pages
Electronic Components Explained
No ratings yet
Electronic Components Explained
23 pages
27-650 ENG Manual PCD1P1001-J30 PQA 01 PDF
No ratings yet
27-650 ENG Manual PCD1P1001-J30 PQA 01 PDF
92 pages
ESPA Interface Module BSL-333 Datasheet
No ratings yet
ESPA Interface Module BSL-333 Datasheet
2 pages
Data Science Quiz Review
No ratings yet
Data Science Quiz Review
8 pages
Mobile Computing Synopsis UNIT-1: Generations of Wireless Mobile Systems
No ratings yet
Mobile Computing Synopsis UNIT-1: Generations of Wireless Mobile Systems
25 pages
Link Exercises
No ratings yet
Link Exercises
11 pages
Audit Sampling
No ratings yet
Audit Sampling
3 pages
Battery Management System
No ratings yet
Battery Management System
12 pages

Week 14

Uploaded by

Week 14

Uploaded by

Apache Hadoop and Spark: Introduction

and Use Cases for Data Analysis

• Growth of big datasets

• The Large Hadron Collider produces about 30 petabytes of

• The second trend is pervasiveness of cloud-based storage and

• Data processing (third trend)

Working at the Intersection of these

Spark, Storm, Tez,

Scripting SQL Like Query

HDFS Distributed File System (Storage)

DataNode DataNode DataNode DataNode

DataNode DataNode DataNode DataNode

Heartbeat, Cmd, Data

Node Node Node Node

Node Node Node

Node Node Node Node

• HDFS files are divided into blocks

List the content of a directory

Upload and download a file in HDFS

Look at the content of a file

Many more commands, similar to Unix

• MapReduce is simple programming paradigm for the Hadoop

• Map and Reduce are based on functional programming

Apply function Map: Reduce:

list1=[1,2,3,4,5]; list1 = [1,2,3,4,5];

Input Map Reduce Output

HDFS HDFS HDFS

Spark: In-Memory Data Sharing

You might also like