0% found this document useful (0 votes)

386 views46 pages

Lesson 1 - Introduction To Big Data and Hadoop

This is introduction to Big data

Uploaded by

PoojaSampath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

386 views46 pages

Lesson 1 - Introduction To Big Data and Hadoop

This is introduction to Big data

Uploaded by

PoojaSampath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Big Data Hadoop and Spark Developer

Lesson 1—Introduction to Big Data and Hadoop

© Simplilearn. All rights reserved.

Learning Objectives

Discuss the basics of big data with a case study

Explain the basics of Hadoop

Describe the components of the Hadoop Ecosystem

Introduction to Big Data and Hadoop
Topic 1—Introduction to Big Data
Data Is Exploding

IBM reported that 2.5 billion gigabytes of data was generated every day in 2012. It is predicted that by 2020:

• About 1.7 megabytes of new information will be generated for every human, every second

• 40,000 search queries will be performed on Google every second

• 300 hours of video will be uploaded to YouTube every minute

• 31.25 million messages will be sent and 2.77 million videos viewed by Facebook users

• 80% of photos will be taken on smartphones

• At least a third of all data will pass through Cloud

Data Is Exploding(Contd.)

By 2020, data will show an exponential rise!

Data in Zettabytes (ZB)

What Is Big Data?

Big data refers to the large volume of structured and unstructured data. The analysis
of big data leads to better insights for business.
Big Data: Case Study
NETFLIX

Netflix is one of the largest providers of commercial streaming video in the US with a customer base of
over 29 million.

It receives a huge volume of behavioral data.

• When do users watch a show?

• Where do they watch it?
• On which device do they watch the show?
• How often do they pause a program?
• How often do they re-watch a program?
• Do they skip the credits?
• What are the keywords searched?
Big Data: Case Study
NETFLIX

Traditionally, the analysis of such data was done using a computer algorithm that was designed to
produce a correct solution for any given instance.

As the data started to grow, a series of computers were employed to do the analysis. They were also
known as distributed systems.
Distributed Systems

A distributed system is a model in which components located on networked

computers communicate and coordinate their actions by passing messages.

https://en.wikipedia.org/wiki/Distributed_computing
How Does a Distributed System Work?

Data =1 Terabyte Data =1 Terabyte

In recent times, distributed systems have been replaced by Hadoop.

Challenges of Distributed Systems

High chances of High programming

1 2 Limited bandwidth 3
system failure complexity

HADOOP is used to overcome these challenges!

Introduction to Big Data and Hadoop
Topic 2—Introduction to Hadoop
What Is Hadoop?

Hadoop is a framework that allows distributed processing of large datasets across

clusters of computers using simple programming models.

Doug Cutting discovered Hadoop and named it after his son’s yellow toy
elephant. It is inspired by the technical document published by Google.

https://twitter.com/cutting
Characteristics of Hadoop

Scalable: Can follow both horizontal

and vertical scaling

Reliable: Stores copies of the data Flexible: Stores a lot of data

on different machines and is and enables you to use it later
resistant to hardware failure

Economical: Ordinary computers

can be used for data processing
Traditional Database Systems vs. Hadoop

Traditional Database Systems Hadoop

Data is stored in a central location and sent to In Hadoop, the program goes to the data. It
the processor at run time. initially distributes the data to multiple systems
and later runs the computation wherever the
data is located.
Traditional Database Systems cannot be used Hadoop works better when the data size is big. It
to process and store a large amount of data can process and store a large amount of data
(big data). easily and effectively.
Traditional RDBMS is used to manage only Hadoop has the ability to process and store a
structured and semi-structured data. It cannot variety of data, whether it is structured or
be used to manage unstructured data. unstructured.
Hadoop Core Components

Data Processing

YARN Resource
Management

Storage

Hadoop Core
Introduction to Big Data and Hadoop
Topic 3—Components of Hadoop Ecosystem
Components of Hadoop Ecosystem

Data Analysis Data Exploration

Data Ingestion
Data Processing
Workflow System

Sqoop
Cluster Resource
YARN Management

Distributed file NoSQL

Flume system

Hadoop Core
Components of Hadoop Ecosystem
HDFS (HADOOP DISTRIBUTED FILE SYSTEM)

pig
• HDFS is a storage layer of Hadoop suitable for distributed storage and processing.

• It provides file permissions, authentication, and streaming access to file system data.
Impala

HDFS can be accessed through Hadoop command line interface.

Components of Hadoop Ecosystem
HBase

pig
• HBase is a NoSQL database or non-relational database that stores data in HDFS.

• It provides support to high volume of data and high throughput.

Impala
• It is used when you need random, real-time read/write access to your big data.

HBase tables can have thousands of columns.

Components of Hadoop Ecosystem
SQOOP

pig
• Sqoop is a tool designed to transfer data between Hadoop and relational database
servers.
Impala
• It is used to import data from relational databases such as Oracle and MySQL to HDFS
and export data from HDFS to relational databases.
Components of Hadoop Ecosystem
FLUME

• Flume is a distributed service for ingesting streaming data suited for event data from pig
multiple systems.

• It has a simple and flexible architecture based on streaming data flows. Impala

• It is robust and fault tolerant and has tunable reliability mechanisms.

• It uses a simple extensible data model that allows for online analytic application.
Components of Hadoop Ecosystem
SPARK

Spark is an open-source cluster computing framework that supports Machine learning, pig
Business intelligence, Streaming, and Batch processing.

Spark solves similar problems as Hadoop MapReduce does but has a fast in-memory Impala
approach and a clean functional style API.

Spark and MapReduce will be discussed in the upcoming lessons.

Components of Hadoop Ecosystem
SPARK: COMPONENTS

pig

Impala

Spark Core and

Machine
Resilient
Spark Learning
Distributed Spark SQL GraphX
Streaming Library
Datasets
(MLlib)
(RDDs)

Apache Spark
Components of Hadoop Ecosystem
HADOOP MAPREDUCE

pig
• Hadoop MapReduce is a framework that processes data. It is the original Hadoop
processing engine, which is primarily Java-based.

• It is based on the map and reduce programming model. Impala

• It has an extensive and mature fault tolerance.

• Hive and Pig are built on map-reduce model.

Components of Hadoop Ecosystem
PIG

• Once the data is processed, it is analyzed using an open-source high-level dataflow pig
system called Pig.

• Pig converts its scripts to Map and Reduce code to reduce the effort of writing complex Impala
map-reduce programs.

• Ad-hoc queries like Filter and Join, which are difficult to perform in MapReduce, can be
easily done using Pig.
Components of Hadoop Ecosystem
IMPALA

• It is an open-source high performance SQL engine that runs on the Hadoop pig
cluster.

• It is ideal for interactive analysis and has very low latency, which can be measured Impala
in milliseconds.

• Impala supports a dialect of SQL, so data in HDFS is modeled as a database table.

Components of Hadoop Ecosystem
HIVE

pig
• Hive is an abstraction layer on top of Hadoop that executes queries using MapReduce.

• It is preferred for data processing and ETL (Extract Transform Load) and ad hoc queries.
Impala
Components of Hadoop Ecosystem
CLOUDERA SEARCH

• It is Cloudera's near-real-time access product that enables non-technical users to pig

search and explore data stored in or ingested into Hadoop and HBase.

• Cloudera Search is a fully integrated data processing platform. It uses the flexible, Impala
scalable, and robust storage system included with CDH or Cloudera’s Distribution,
including Hadoop.
Components of Hadoop Ecosystem
OOZIE

• Oozie is a workflow or coordination system used to manage the Hadoop tasks. pig

• Oozie coordinator can trigger jobs by time (frequency) and data availability.
Impala
Components of Hadoop Ecosystem
OOZIE APPLICATION LIFECYCLE

pig

Oozie Coordinator Oozie Workflow

Engine Engine
Start Impala
B
Action A

Action1 C
Action2

Action3

End
Components of Hadoop Ecosystem
HUE (HADOOP USER EXPERIENCE)

• Hue is an acronym for Hadoop User Experience. It is an open source Web interface for pig
analyzing data with Hadoop.

• It provides SQL editors for Hive, Impala, MySQL, Oracle, PostgreSQL, Spark SQL, and Impala
Solr SQL.
Big Data Processing

Components of Hadoop ecosystem work together to process big data. There are four stages of big
data processing:
Key Takeaways

Hadoop is a framework for distributed storage and processing.

Core components of Hadoop include HDFS for storage, YARN for cluster-resource
management, and MapReduce or Spark for processing.

The Hadoop ecosystem includes multiple components that support each stage of
big data processing:

• Flume and Scoop ingest data

• HDFS and HBase store data
• Spark and MapReduce process data
• Pig, Hive, and Impala analyze data
• Hue and Search help to explore data
• Oozie manages the workflow of Hadoop tasks
Quiz
QUIZ
What is a Distributed system?
1

a. One machine processing a file

b. Multiple machines processing a file

c. A Traditional system

d. In-memory computation
QUIZ
What is a Distributed system?
1

a. One machine processing a file

b. Multiple machines processing a file

c. A Traditional system

d. In-memory computation

The correct answer is b.

In distributed systems, you use multiple machines to process one file.
QUIZ
What is Hadoop?
2

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across

c. clusters of commodity computers using a simple programming model.

d. It is a search and analytics tool that provides access to analyze data.

QUIZ
What is Hadoop?
2

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across

c. clusters of commodity computers using a simple programming model.

d. It is a search and analytics tool that provides access to analyze data.

The correct answer is c.

Hadoop is a framework that allows for distributed processing of large datasets across clusters of
commodity computers using a simple programming model.
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3

a. Economical

b. Adaptable

c. Flexible

d. Reliable
QUIZ
Which of the following is NOT a key characteristic of Hadoop?
3

a. Economical

b. Adaptable

c. Flexible

d. Reliable

The correct answer is b.

The four key characteristics of Hadoop are that it is economical, reliable, scalable, and flexible.
QUIZ
Which of the following is used in the data storage processing stage?
4

a. Impala

b. Spark

c. Hive

d. HDFS/HBase
QUIZ
Which of the following is used in the data storage processing stage?
4

a. Impala

b. Spark

c. Hive

d. HDFS/HBase

The correct answer is d.

HBase/HDFS is used in the data storage processing stage.
QUIZ
Scoop is used to _______.
5

import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases

b. execute queries using MapReduce

enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase

d. stream event data from multiple systems

QUIZ
Scoop is used to _______.
5

import data from relational databases to Hadoop HDFS and export from Hadoop file
a.
system to relational databases

b. execute queries using MapReduce

enable non-technical users to search and explore data stored in or ingested into
c. Hadoop and HBase

d. stream event data from multiple systems

The correct answer is a.

Scoop is used to import data from relational databases to Hadoop HDFS and export from Hadoop
file system to relational databases.
This concludes “Introduction to Big Data and
Hadoop.”
The next lesson is “HDFS and YARN.”

Microsoft Excel 2013: MOS Foundation
No ratings yet
Microsoft Excel 2013: MOS Foundation
11 pages
Project 2: Library Management System For Stanford
No ratings yet
Project 2: Library Management System For Stanford
15 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Big Data
No ratings yet
Big Data
22 pages
Chapter Two: 4.1 The Structured Paradigm Versus The Object-Oriented Paradigm
100% (1)
Chapter Two: 4.1 The Structured Paradigm Versus The Object-Oriented Paradigm
43 pages
Castle Bay Restaurant System Backlog
100% (1)
Castle Bay Restaurant System Backlog
5 pages
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
100% (1)
Lab 4 - Installation of Hadoop and MapReduce WordCount Example
16 pages
Specialized Visualization Tools - Coursera PDF
50% (2)
Specialized Visualization Tools - Coursera PDF
3 pages
Software Project Management
67% (3)
Software Project Management
252 pages
Size and Effort Paper
No ratings yet
Size and Effort Paper
9 pages
Mapping Requirements Into Software Architecture
No ratings yet
Mapping Requirements Into Software Architecture
8 pages
CSEN604: Database II Project 1: German University in Cairo Faculty of Media Engineering and Technology
No ratings yet
CSEN604: Database II Project 1: German University in Cairo Faculty of Media Engineering and Technology
11 pages
Hospital Management System: Objective
No ratings yet
Hospital Management System: Objective
112 pages
Understanding Server Roles
No ratings yet
Understanding Server Roles
32 pages
Cep 1 Employee Performance Mapping Problem Statment
No ratings yet
Cep 1 Employee Performance Mapping Problem Statment
10 pages
Data Analytics and Reporting - Notes Unit 1 and 2
No ratings yet
Data Analytics and Reporting - Notes Unit 1 and 2
11 pages
B Tree
No ratings yet
B Tree
63 pages
Unit - Iv: Machine Learning (ML) For Iot
No ratings yet
Unit - Iv: Machine Learning (ML) For Iot
17 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
Data Warehousing AND Data Mining
100% (1)
Data Warehousing AND Data Mining
90 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Software Requirements Specification: Final Version
No ratings yet
Software Requirements Specification: Final Version
14 pages
JNTUA-B.Tech.2-2 CSE-R15-SYLLABUS PDF
No ratings yet
JNTUA-B.Tech.2-2 CSE-R15-SYLLABUS PDF
24 pages
Project Synopsis of Python
100% (1)
Project Synopsis of Python
6 pages
Component and Deployment Diagrams
No ratings yet
Component and Deployment Diagrams
16 pages
IoT Workshop for Educators & Industry
No ratings yet
IoT Workshop for Educators & Industry
2 pages
Srs Template
No ratings yet
Srs Template
9 pages
User Interface Requirements Specifications
No ratings yet
User Interface Requirements Specifications
7 pages
Chapter 2
No ratings yet
Chapter 2
66 pages
Client/Server Computing Guide
No ratings yet
Client/Server Computing Guide
41 pages
XQuery
No ratings yet
XQuery
21 pages
Data Mining and Data Warehousing
100% (2)
Data Mining and Data Warehousing
11 pages
Etl
No ratings yet
Etl
13 pages
Wipro 7b1630 User Manuel English
No ratings yet
Wipro 7b1630 User Manuel English
70 pages
BLV1607 Master Data Scope Overview Man EN NL
No ratings yet
BLV1607 Master Data Scope Overview Man EN NL
17 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
Technical Design Document
No ratings yet
Technical Design Document
4 pages
Software Engineering Lab Sanamdeep
No ratings yet
Software Engineering Lab Sanamdeep
29 pages
DBMS Case Study
No ratings yet
DBMS Case Study
14 pages
Database Systems and Big Data
No ratings yet
Database Systems and Big Data
8 pages
UML and RationalRose
100% (1)
UML and RationalRose
92 pages
Database Environment
100% (1)
Database Environment
20 pages
Notes Unit 3 Project Management
No ratings yet
Notes Unit 3 Project Management
20 pages
Resume
No ratings yet
Resume
3 pages
Software Requirement Specification
No ratings yet
Software Requirement Specification
6 pages
Project Report Canteen Management System
No ratings yet
Project Report Canteen Management System
41 pages
Student Feedback Management System Project Report
No ratings yet
Student Feedback Management System Project Report
53 pages
Six Sigma Success in Asset Management
No ratings yet
Six Sigma Success in Asset Management
2 pages
ETL Specific
No ratings yet
ETL Specific
12 pages
Apache HIVE
No ratings yet
Apache HIVE
9 pages
Idc Wipro Product Engg Services RD Profile
No ratings yet
Idc Wipro Product Engg Services RD Profile
13 pages
PIg in BIg Data
No ratings yet
PIg in BIg Data
28 pages
Business Analytics Presentation: Titanic Survival Analysis and Prediction
No ratings yet
Business Analytics Presentation: Titanic Survival Analysis and Prediction
15 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
56 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Hadoop Intro - Part1
No ratings yet
Hadoop Intro - Part1
45 pages
Adv & Disadv of M Commerce
No ratings yet
Adv & Disadv of M Commerce
10 pages
HD Encoder / Modulator Installation Manual
No ratings yet
HD Encoder / Modulator Installation Manual
8 pages
Reaction Paper: Republic Act 10173
83% (6)
Reaction Paper: Republic Act 10173
1 page
Creating and Using Virtual DPUs
100% (1)
Creating and Using Virtual DPUs
20 pages
If I Can Dream Create Score - Musescore
No ratings yet
If I Can Dream Create Score - Musescore
1 page
OpenMAX IL 1 1 2 Specification
No ratings yet
OpenMAX IL 1 1 2 Specification
414 pages
Virtualization Security For Cloud Computing
No ratings yet
Virtualization Security For Cloud Computing
13 pages
LOW-LEVEL Design Entry
No ratings yet
LOW-LEVEL Design Entry
24 pages
AX5214H 48 Bits DIO Board User's Manual
100% (1)
AX5214H 48 Bits DIO Board User's Manual
52 pages
Hitachi Vm-E10 Tech Info
No ratings yet
Hitachi Vm-E10 Tech Info
32 pages
Online Cloth Shopping PHP Website
75% (4)
Online Cloth Shopping PHP Website
116 pages
Intel Shark Bay ULT Schematic
No ratings yet
Intel Shark Bay ULT Schematic
57 pages
Internet of Things For Industrial Monitoring and Control Applications PDF
No ratings yet
Internet of Things For Industrial Monitoring and Control Applications PDF
5 pages
Product Name:quick Heal Total Security Product key:6Y86B-7BE1F-67207-11610 Pb2Fj-N6Rmh-Qgjwk-Cc92M-Bbdhj
67% (3)
Product Name:quick Heal Total Security Product key:6Y86B-7BE1F-67207-11610 Pb2Fj-N6Rmh-Qgjwk-Cc92M-Bbdhj
16 pages
Cisco Unified Ip Phone 6900 Series Datasheet
No ratings yet
Cisco Unified Ip Phone 6900 Series Datasheet
10 pages
Intro to Data Structures
No ratings yet
Intro to Data Structures
5 pages
Digital Waiter App - Smart Dine-In
No ratings yet
Digital Waiter App - Smart Dine-In
6 pages
Smart Energy Meter
100% (1)
Smart Energy Meter
18 pages
CS6551 CN Notes
No ratings yet
CS6551 CN Notes
216 pages
Jiuzhou
No ratings yet
Jiuzhou
5 pages
Ranjithkumar Resume
No ratings yet
Ranjithkumar Resume
5 pages
Types of Honeypots 1. High-Interaction Honeypots
No ratings yet
Types of Honeypots 1. High-Interaction Honeypots
2 pages
B.B Maintenance Book MP 27.12.2011
100% (3)
B.B Maintenance Book MP 27.12.2011
68 pages
Q What Is The Objective of Job-Portal Project?
100% (1)
Q What Is The Objective of Job-Portal Project?
25 pages
ISDN: Digital Network Standards Guide
100% (3)
ISDN: Digital Network Standards Guide
13 pages
99 PLC SCADA Interview Questions and Answers 5 - PLC Tutorial Point
No ratings yet
99 PLC SCADA Interview Questions and Answers 5 - PLC Tutorial Point
6 pages
PL2303 Mac OS X 10.6 and Above Driver Installation Guide PDF
No ratings yet
PL2303 Mac OS X 10.6 and Above Driver Installation Guide PDF
10 pages
Huawei 3G Introduction PDF
No ratings yet
Huawei 3G Introduction PDF
64 pages
Azure Storage
No ratings yet
Azure Storage
4 pages
ExOS 6.4 - User Manual
No ratings yet
ExOS 6.4 - User Manual
480 pages

Lesson 1 - Introduction To Big Data and Hadoop

Uploaded by

Lesson 1 - Introduction To Big Data and Hadoop

Uploaded by

Big Data Hadoop and Spark Developer

Lesson 1—Introduction to Big Data and Hadoop

© Simplilearn. All rights reserved.

Discuss the basics of big data with a case study

Explain the basics of Hadoop

Describe the components of the Hadoop Ecosystem

• 40,000 search queries will be performed on Google every second

• 300 hours of video will be uploaded to YouTube every minute

• 80% of photos will be taken on smartphones

• At least a third of all data will pass through Cloud

By 2020, data will show an exponential rise!

Data in Zettabytes (ZB)

It receives a huge volume of behavioral data.

• When do users watch a show?

A distributed system is a model in which components located on networked

Data =1 Terabyte Data =1 Terabyte

In recent times, distributed systems have been replaced by Hadoop.

High chances of High programming

HADOOP is used to overcome these challenges!

Hadoop is a framework that allows distributed processing of large datasets across

Scalable: Can follow both horizontal

Reliable: Stores copies of the data Flexible: Stores a lot of data

Economical: Ordinary computers

Traditional Database Systems Hadoop

Data Analysis Data Exploration

Distributed file NoSQL

HDFS can be accessed through Hadoop command line interface.

• It provides support to high volume of data and high throughput.

HBase tables can have thousands of columns.

• It is robust and fault tolerant and has tunable reliability mechanisms.

Spark and MapReduce will be discussed in the upcoming lessons.

Spark Core and

• It is based on the map and reduce programming model. Impala

• It has an extensive and mature fault tolerance.

• Hive and Pig are built on map-reduce model.

• Impala supports a dialect of SQL, so data in HDFS is modeled as a database table.

• It is Cloudera's near-real-time access product that enables non-technical users to pig

Oozie Coordinator Oozie Workflow

Hadoop is a framework for distributed storage and processing.

• Flume and Scoop ingest data

a. One machine processing a file

b. Multiple machines processing a file

a. One machine processing a file

b. Multiple machines processing a file

The correct answer is b.

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across

d. It is a search and analytics tool that provides access to analyze data.

a. It is an in-memory tool used in Mahout algorithm computing.

b. It is a computing framework used for resource management.

It is a framework that allows for distributed processing of large datasets across

d. It is a search and analytics tool that provides access to analyze data.

The correct answer is c.

The correct answer is b.

The correct answer is d.

b. execute queries using MapReduce

d. stream event data from multiple systems

b. execute queries using MapReduce

d. stream event data from multiple systems

The correct answer is a.

©Simplilearn. All rights reserved

You might also like