0% found this document useful (0 votes)
88 views

Big Data & Hadoop - Course Curriculum

This 10 module course provides in-depth knowledge of concepts related to Big Data and Hadoop. The course covers topics such as Hadoop architecture, HDFS, MapReduce, Pig, Hive, HBase, Zookeeper, Oozie and how these components work together in a Hadoop implementation. Students will learn through hands-on demos and projects to solve Big Data problems using the Hadoop ecosystem.

Uploaded by

manish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

Big Data & Hadoop - Course Curriculum

This 10 module course provides in-depth knowledge of concepts related to Big Data and Hadoop. The course covers topics such as Hadoop architecture, HDFS, MapReduce, Pig, Hive, HBase, Zookeeper, Oozie and how these components work together in a Hadoop implementation. Students will learn through hands-on demos and projects to solve Big Data problems using the Hadoop ecosystem.

Uploaded by

manish
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Course Curriculum: Your 10 Module Learning Plan

Big Data and Hadoop

About Edureka

Edureka is a leading e-learning platform providing live instructor-led interactive online


training. We cater to professionals and students across the globe in categories like Big Data
& Hadoop, Business Analytics, NoSQL Databases, Java & Mobile Technologies, System
Engineering, Project Management and Programming.

We have an easy and affordable learning solution that is accessible to millions of learners.
With our students spread across countries like the US, India, UK, Canada, Singapore,
Australia, Middle East, Brazil and many others, we have built a community of over 1 million
learners across the globe.

About The Course


Big Data and Hadoop training course is designed to provide knowledge and skills to become
a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed
File System, Hadoop Cluster- Single and Multi node, Hadoop 2.x, Flume, Sqoop, Map-Reduce,
PIG, Hive, HBase, Zookeeper, Oozie etc. will be covered in the course.

1
Module 1 Module 3
Understanding Big Data and Hadoop Hadoop MapReduce Framework - I
Learning Objectives - In this module, you will understand Big Learning Objectives - In this module, you will understand
Data, the limitations of the existing solutions for Big Data Hadoop MapReduce framework and the working of
problem, how Hadoop solves the Big Data problem, the MapReduce on data stored in HDFS. You will learn about
common Hadoop ecosystem components, Hadoop YARN concepts in MapReduce.
Architecture, HDFS, Anatomy of File Write and Read, Rack
Awareness. Topics
MapReduce Use Cases
Topics Traditional way Vs MapReduce way
Big Data Why MapReduce
Limitations and Solutions of existing Data Analytics Hadoop 2.x MapReduce Architecture
Architecture
Hadoop 2.x MapReduce Components
Hadoop
YARN MR Application Execution Flow
Hadoop Features
YARN Workflow
Hadoop Ecosystem Anatomy of MapReduce Program
Hadoop 2.x core components Demo on MapReduce.
Hadoop Storage: HDFS, Hadoop Processing:
MapReduce Framework
Anatomy of File Write and Read, Rack Awareness. Module 4
Hadoop MapReduce Framework - II
Learning Objectives - In this module, you will understand
Module 2 concepts like Input Splits in MapReduce, Combiner &
Hadoop Architecture and HDFS Partitioner and Demos on MapReduce using different data
Learning Objectives - In this module, you will learn the sets.
Hadoop Cluster Architecture, Important Configuration files in
a Hadoop Cluster, Data Loading Techniques. Topics
Input Splits
Topics Relation between Input Splits and HDFS Blocks
Hadoop 2.x Cluster Architecture - Federation and MapReduce Job Submission Flow
High Availability
Demo of Input Splits
A Typical Production Hadoop Cluster
MapReduce: Combiner & Partitioner
Hadoop Cluster Modes
Demo on de-identifying Health Care Data set
Common Hadoop Shell Commands
Demo on Weather Dataset
Hadoop 2.x Configuration Files
Password-Less SSH
MapReduce Job Execution
Data Loading Techniques: Hadoop Copy
Commands
FLUME
SQOOP

2
Module 5 Module 7
Advance MapReduce Hive
Learning Objectives - In this module, you will learn Advance Learning Objectives - This module will help you in
MapReduce concepts such as Counters, Distributed Cache, understanding Hive concepts, Loading and Querying Data in
MRunit, Reduce Join, Custom Input Format, Sequence Input Hive and Hive UDF.
Format and how to deal with complex MapReduce
programs. Topics
Hive Background
Topics Hive Use Case
Counters About Hive
Distributed Cache Hive Vs Pig
MRunit Hive Architecture and Components
Reduce Join Metastore in Hive
Custom Input Format Limitations of Hive
Sequence Input Format Comparison with Traditional Database
Hive Data Types and Data Models
Partitions and Buckets
Module 6 Hive Tables (Managed Tables and External Tables)
Pig Importing Data
Learning Objectives - In this module, you will learn Pig, types
Querying Data
of use case we can use Pig, tight coupling between Pig and
MapReduce, and Pig Latin scripting. Managing Outputs
Hive Script
Topics Hive UDF
About Pig Hive Demo on Healthcare Data set
MapReduce Vs Pig
Pig Use Cases
Programming Structure in Pig Module 8
Pig Running Modes Advance Hive and HBase
Learning Objectives - In this module, you will understand
Pig components
Advance Hive concepts such as UDF, dynamic Partitioning.
Pig Execution
You will also acquire in-depth knowledge of HBase, Hbase
Pig Latin Program Architecture and its components.
Data Models in Pig
Pig Data Types Topics
Pig Latin : Relational Operators, File Loaders, Hive QL: Joining Tables, Dynamic Partitioning,
Group Operator, COGROUP Operator, Joins and Custom Map/Reduce Scripts
COGROUP, Union, Diagnostic Operators Hive : Thrift Server, User Defined Functions
Pig UDF HBase: Introduction to NoSQL Databases and
Pig Demo on Healthcare Data set HBase, HBase v/s RDBMS, HBase Components,
HBase Architecture, HBase Cluster Deployment.

3
Module 9
Advance HBase
Learning Objectives - This module will cover Advance HBase concepts. We will see demos on Bulk Loading, Filters. You will also
learn what Zookeeper is all about, how it helps in monitoring a cluster, why HBase uses Zookeeper.

Topics
HBase Data Model
HBase Shell
HBase Client API
Data Loading Techniques
ZooKeeper Data Model
Zookeeper Service
Zookeeper
Demos on Bulk Loading
Getting and Inserting Data
Filters in HBase

Module 10
Oozie and Hadoop Project
Learning Objectives - In this module, you will understand working of multiple Hadoop ecosystem components together in a
Hadoop implementation to solve Big Data problems. We will discuss multiple data sets and specifications of the project. This
module will also cover Flume & Sqoop demo and Apache Oozie Workflow Scheduler for Hadoop Jobs.

Topics
Flume and Sqoop Demo
Oozie
Oozie Components
Oozie Workflow
Scheduling with Oozie
Demo on Oozie Workflow
Oozie Co-ordinator
Oozie Commands
Oozie Web Console
Hadoop Project Demo

4
Project Work

Towards the end of the Course, you will be working on a live project where you will be using PIG, HIVE, HBase and MapReduce to
perform Big Data analytics.
Here are the few Industry-wise Big Data case studies e.g. Finance, Retail, Media, Aviation etc. which you can take up as your
project work:

Project #1: Analyze social bookmarking sites to find insights


Industry: Social Media
Data: It comprises of the information gathered from sites like reddit.com, stumbleupon.com etc which are bookmarking sites and
allow you to bookmark, review, rate, search various links on any topic.reddit.com, stumbleupon.com, etc. A bookmarking site
allows you to bookmark, review, rate, search various links on any topic. The data is in XML format and contains various links/posts
URL, categories defining it and the ratings linked with it.
Problem Statement: Analyze the data in Hadoop Eco-system to:
1. Fetch the data into Hadoop Distributed File System and analyze it with the help of MapReduce, Pig and Hive to find the top
rated links based on the user comments, likes etc.
2. Using MapReduce convert the semi-structured format (XML data) into structured format and categorize the user rating as
positive and negative for each of the thousand links.
3. Push the output HDFS and then feed it into PIG, which splits the data into two parts: Category data and Ratings data.
4. Write a fancy Hive Query to analyze the data further and push the output is into relational database (RDBMS) using Sqoop.
5. Use a web server running on grails/java/ruby/python that renders the result in real time processing on a website.

Project #2: Customer Complaints Analysis


Industry: Retail
Data: Publicly available dataset, containing a few lakh observations with attributes like: CustomerId, Payment Mode, Product
Details, Complaint, Location, Status of the complaint, etc.
Problem Statement: Analyze the data in Hadoop Eco-system to:
1. Get the number of complaints filed under each products
2. Get the total number of complaints filed from a particular location
3. Get the list of complaints grouped by location which has no timely response

Project #3: Tourism Data Analysis


Industry: Tourism
Data: The dataset comprises attributes like: City pair (Combination of from and to), Adults traveling, Seniors traveling, Children
traveling, Air booking price, Car booking price, etc.
Problem Statement: Find the following insights from the data:
1. Top 20 destinations people travel most : Based on given data we can find the most popular destinations where people travel
frequently, based on the specific initial number of trips booked for a particular destination
2. Top 20 locations from where most of the trips start based on booked trip count
3. Top 20 high air-revenue destinations i.e which 20 cities generates high airline revenues for travel, so that the discount offers
can be given to attract more bookings for these destinations

5
Project #4: Airline Data Analysis
Industry: Aviation
Data: Publicly available dataset which contains the flight details of various airlines like : Airport id, Name of the airport, Main city
served by airport, Country or territory where airport is located, Code of Airport, Decimal degrees, Hours offset from UTC,
Timezone, etc.
Problem Statement: Analyze the airlines data to:
1. Find list of Airports operating in the Country
2. Find the list of Airlines having zero stops
3. List of Airlines operating with code share
4. Which country (or) territory has the highest number of Airports
5. Find the list of Active Airlines in the United States

Project #5: Analyze Loan Dataset


Industry: Banking and Finance
Data: Publicly available dataset which contains complete details of all the loans issued, including the current loan status (Current,
Late, Fully Paid, etc.) and latest payment information.
Problem Statement: Find the number of cases per location and categorize the count with respect to reason for taking loan and
display the average risk score

Project #6: Analyze Movie Ratings


Industry: Media
Data: Publicly available data from sites like rotten tomatoes, imdb, etc.
Problem Statement: Analyze the movie ratings by different users to:
1. Get the user who has rated the most number of movies
2. Get the user who has rated the least number of movies
3. Get the count of total number of movies rated by user belonging to a specific occupation
4. Get the number of under age users

Project #7: Analyze YouTube data


Industry: Social Media
Data: It is about the YouTube videos and contains attributes like : VideoID, Uploader, Age, Category, Length, views, ratings,
comments, etc.
Problem Statement: Find out the top 5 categories in which the most number of videos are uploaded, the top 10 rated videos, the
top 10 most viewed videos
Apart from these there are some twenty more use-cases to choose from :
Market data Analysis
Twitter Data Analysis
Olympics Data Analysis etc

Big Data and Hadoop

You might also like