Academic Session 2025-26
ODD Semester Jul-Dec 2025
UNIVERSITY INSTITUTE OF ENGINEERING
APEX INSTITUTE OF TECHNOLOGY
B.E CSE/ Big Data
(5th Semester)
Big Data Engineering
(22CSH-342)
Unit No. 1 Chapter No. 3 Lecture No. 13 n 14
Topic : Dr. Monica Luthra(E9836) Associate Professor
Big Data Engineering : COURSE OBJECTIVES
The Course aims to:
1. Make students understand the Big Data and Data Analytics, Horton works Data
Platform (HDP) and Apache Ambari
2. Make students understand and Storing and Querying data Zoo Keeper, Slider, and
Knox and Loading data with Sqooq
3. Enable students to develop and implement stream computing
4. Introduce students to the fundamentals of Apache Flink and its role in real- time
stream processing
2
COURSE OUTCOMES
On completion of this course, the students shall be able to:-
Understand the fundamentals of Big Data and Data Analytics, including the components of
CO1
Hortonworks Data Platform (HDP) and the significance of Apache Hadoop
Explain the process of loading, visualizing, and pre-processing data models in the context
CO2
of Big Data and Analytics
Apply simple learning strategies using stream computing to identify and implement
CO3
effective data processing techniques
CO4 Evaluate the security and optimization measures for Big SQL environments
CO5 Create the functionalities of Watson Studio, detailing the process of creating and managing
projects, adding collaborators, and efficiently handling data within the platform
3
Unit-1 Syllabus
Unit-1 Introduction to Big Data Analytics
Chapter-1 Gain Comprehensive knowledge of the Open-source Hadoop ecosystem , evaluate the major
distributions and acquire hands on experience with key components for building big data solutions.
Explore Apache Ambari's functions, manage Hadoop clusters, understand HDFS, and execute Map
Reduce and YARN jobs for efficient cluster operation.
Chapter-2
Learn Apache Spark principles, RDD usage, various data file formats, NoSQL data stores, Pig, Hive,
ZooKeeper , Apache Slider, and data loading techniques using
Chapter-3 Sqoop and Flume in the Hadoop environment
4
SUGGESTIVE READINGS
TEXT BOOKS:
T1 Data Science handbook, O’REILLY (2016).
T2: Hadoop: The Definitive Guide, Tom White.
T3: Big Data: A Revolution That Will Transform How We Live, Work, and Think, Viktor Mayer- Schönberger
and Kenneth Cukier.
T4: Data Engineering Teams, Mike Barlow
REFERENCE BOOKS:
R1 Big Data: Principles and Best Practices of Scalable Realtime Data Systems, Nathan Marzand James Warren
R2: Big Data: A Very Short Introduction, Dawn E. Holmes .
R3: Mastering Apache Spark 2.x: Scalable Machine Learning and Big Data Analytics,Romeo Kienzler
R4: Big Data SMACK: A Guide to Apache Spark, Mesos, Akka, Cassandra, and Kafka,Raul Estrada and Isaac Ruiz
5
Learning Outcome of this
lecture
•T
Unit Name Outcome
I Introduction to Big Data Big data Introduction
Analytics
Hadoop ecosystem
Apache Ambari and Cluster Management
• Understand the fundamentals of Big Data and Data Analytics, including the
components of Hortonworks Data Platform (HDP) and the significance of
Apache Hadoop. (CO1)
• Explain the process of loading, visualizing, and pre-processing data models
in the context of Big Data and Analytics. (CO2)
APEX INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE AND
6
ENGINEERING
Introduction to YARN
• • YARN = Yet Another Resource Negotiator
• • Introduced in Hadoop 2.x
• • Decouples resource management and job scheduling
Why YARN?
• • Overcomes MapReduce limitations
• • Supports multiple processing engines
• • Improves resource utilization and flexibility
YARN Architecture Overview
• • ResourceManager (RM)
• • NodeManager (NM)
• • ApplicationMaster (AM)
• • Container
YARN Components
• 1. ResourceManager – Global resource scheduler
• 2. NodeManager – Manages containers on each node
• 3. ApplicationMaster – Manages execution of an application
• 4. Container – Executes a task
YARN Workflow
• 1. User submits job
• 2. RM assigns container for AM
• 3. AM requests resources
• 4. NM launches containers to run tasks
Benefits of YARN
• • Better scalability and resource utilization
• • Supports multiple frameworks
• • Fault tolerance and isolation
YARN Use Cases
• • Big Data (Spark, Hive, Pig)
• • Machine Learning
• • Real-time streaming (Storm, Flink)
Conclusion
• • Flexible, scalable resource management
• • Backbone of modern Hadoop ecosystem
Summary of the lecture
• Learn about the basics of Big data
• Learn about the nature of architecture
• Learn about the Yarn
APEX INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE AND 15
ENGINEERING
Questions of this lecture
• Define Hadoop ecosystem
• Define role of yarn
• Define role of analytics
APEX INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE AND 16
ENGINEERING
Big Data Computing
• https://archive.nptel.ac.in/courses/106/104/106104189/
APEX INSTITUTE OF TECHNOLOGY CSE INFORMATION
17
SECURITY
E-Resources
1. https://www.geeksforgeeks.org/hadoop-an-introduction/
2. https://cloud.google.com/learn/what-is-hadoop
3. https://aws.amazon.com/what-is/hadoop/
4. https://www.tableau.com/learn/articles/big-data-hadoop-explained
APEX INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE AND
18
ENGINEERING
REFERENCES
1. Kumar, Abhay & Jothimani, Dhanya. (2017). Big Data:
Challenges, Opportunities and Realities.
10.48550/arXiv.1705.04928.
2. Beakta, Rahul. (2015). Big Data And Hadoop: A Review
Paper. international journal of computer science &
information te.
APEX INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE AND
19
ENGINEERING
Class Session Review
12
Thank You
For queries
Email: [email protected]