0% found this document useful (0 votes)
14 views15 pages

QB - Updated 1

The document outlines the syllabus for the IT4651 Big Data Analytics course for the academic year 2024-2025, detailing five units covering topics such as the introduction to big data, data analysis techniques, big data file systems, mining data streams, and big data models. It includes course outcomes, mapping of course outcomes to program outcomes, and references to textbooks and additional reading materials. The course aims to equip students with fundamental concepts, tools, and practices related to big data and analytics.

Uploaded by

pm96mithun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views15 pages

QB - Updated 1

The document outlines the syllabus for the IT4651 Big Data Analytics course for the academic year 2024-2025, detailing five units covering topics such as the introduction to big data, data analysis techniques, big data file systems, mining data streams, and big data models. It includes course outcomes, mapping of course outcomes to program outcomes, and references to textbooks and additional reading materials. The course aims to equip students with fundamental concepts, tools, and practices related to big data and analytics.

Uploaded by

pm96mithun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

IT4651 – Big Data Analytics Department of CSE 2024-2025

IT4651 BIG DATA ANALYTICS L T P


(Common to IT & CSE) C
3 0 0
3
UNIT – I INTRODUCTION TO BIG DATA
Defining Big Data – 5V’s of Big Data – Traditional Vs Big Data Systems -Big Data Applications -
Risks of Big Data – Structure of Big Data - Big Data Use Cases -Understanding Big Data Storage-
Evolution of Big Data-Big Data Technologies- Data Analytics Lifecycle-Data analytics lifecycle
overview- Discovery- Data Preparation.
UNIT – II DATA ANALYSIS
Overview of Clustering - K-means - Use Cases - Overview of the Method - Determining the Number
of Clusters. - Classification: Decision Trees - Overview of a Decision Tree - The General Algorithm -
Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R - Naïve Bayes – Bayes
Theorem - Naïve Bayes Classifier.
UNIT - III BIG DATA FILE SYSTEM
Google File System (GFS) -Distributed File Systems - Large-Scale File System Organization –
Hadoop Ecosystem – Hadoop Distributed File System (HDFS) concepts – HDFS Architecture- HDFS
Commands- Hadoop Map Reduce -Map reduce Programming Model- Hadoop YARN- Case Studies-
Word count program.
UNIT - IV MINING DATA STREAMS
Streams Concepts – Stream Data Model and Architecture Sampling Data in a Stream – Filtering
Streams – Counting Distinct Elements in a Stream – Estimating moments – Counting oneness in a
Window – Decaying Window – Real time Analytics Platform(RTAP) applications - Case Studies -
Real Time Sentiment Analysis, Stock Market Predictions.
UNIT - V BIGDATA MODELS
Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase
Clients – Examples – .Pig Data Model –Hive – Data Types and File Formats – HiveQL Data
Definition – HiveQL Data Manipulation – HiveQL Queries
Total Periods:45
TEXT BOOKS:
1.Bill Franks, ―Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with
Advanced Analytics‖, Wiley and SAS Business Series, 2012.
2. David Loshin, "Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools,
Techniques, NoSQL, and Graph", Morgan Kaufmann/El sevier Publishers, 2013.
REFERENCE BOOKS:
1. Michael Berthold, David J. Hand, ―Intelligent Data Analysis‖, Springer, Second Edition, 2007.
2. Michael Minelli, Michelle Chambers, and AmbigaDhiraj, "Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses", Wiley, 2013.

COURSE OUTCOMES
On successful completion of this course, the student will be able to
C310.1 To know the fundamental concepts of big data and analytics
C310.2 To explore tools of big data and analytics.
C310.3 To know the practices for working with big data
C310.4 To learn about stream computing
1
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
C310.5 To know about the research that requires the integration of large amounts of data.

MAPPING BETWEEN CO AND PO, PSO WITH CORRELATION LEVEL 1/2/3


Pos PSOs
IT46 P P P P P P P P P PSO4
PO PO PO PS PS PS
51 O O O O O O O O O
10 11 12 O1 O2 O3
1 2 3 4 5 6 7 8 9
C309 1 1 1 - - - - - - - - - 1 - - 1 1
.1
C309 1 1 - 1 - 2 - - - - - - - 1 2 - 1
.2
C309 1 1 - 1 - - - - - - - - 1 1 - 1 1
.3
C309 - - 2 1 - - - - - - - - - - - - -
.4
C309 - - - - - - - - - - - - - 1 - - -
.5

RELATION BETWEEN COURSE CONTENT WITH COs

UNIT I - INTRODUCTION TO BIG DATA


Knowledge Books Course
S.No Topic
level Referred Outcomes
1. Defining Big Data – 5V’s of Big Data BL1 T1
Traditional Vs Big Data Systems -Big Data T1
2. BL1,BL3
Applications
3. Risks of Big Data BL1,BL4 T1
4. Structure of Big Data - Big Data Use Cases BL1 T1
5. Understanding Big Data Storage BL4,BL5 T1
C310.1
Evolution of Big Data-Big Data T1
6. BL2,BL4
Technologies
Data Analytics Lifecycle-Data analytics T1
7. BL4
lifecycle overview
8. Discovery BL2 T1
9. Data Preparation. BL1,BL2,BL6 T1

UNIT II - DATA ANALYSIS


Knowledge Books Course
S.No Topic
level Referred Outcomes
1. Overview of Clustering - K-means BL2 T1 C310.2
Use Cases - Overview of the T1
2. BL1
Method
3. Determining the Number of BL1,BL4 T1
Clusters

2
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
4. Classification: Decision Trees BL1 T1
Overview of a Decision Tree - The T1
5. BL4
General Algorithm
6. Decision Tree Algorithms BL2,BL5 T1
7. Evaluating a Decision Tree BL4 T1
8. Decision Trees in R - Naïve Bayes BL2 T1
Bayes Theorem - Naïve Bayes T1
9. BL2,BL6
Classifier.

UNIT III - BIG DATA FILE SYSTEM


Knowledge Books Course
S.No Topic
level Referred Outcomes
1. Google File System (GFS) BL1 T1
Distributed File Systems - Large- T1
2. BL3
Scale FileSystem Organization
3. Hadoop Ecosystem BL3,BL4 T1
Hadoop Distributed File System T1
4. BL1
(HDFS) concepts
C310.3
5. HDFS Architecture BL5 T1
HDFS Commands- T1
6. BL2,BL4
HadoopMapReduce
7. Map reduce Programming Model BL4 T1
8. Hadoop YARN BL5 T1
9. Case Studies-Word count program. BL1,BL5,BL6 T1
UNIT IV - MINING DATA STREAMS
Knowledge Books Course
S.No Topic
level Referred Outcomes
1. Streams Concepts – Stream Data
BL4 T1
Model and Architecture
2. T1
Sampling Data in a Stream BL3

3. T1
Filtering Streams BL1,BL4

4. T1
Counting Distinct Elements in a Stream BL3

5. T1
Estimating moments BL4 C310.4

6. T1
Counting oneness in a Window BL2,BL4

7. Decaying Window – Real time T1


BL5
Analytics Platform(RTAP) applications
8. Case Studies - Real Time Sentiment T1
BL3
Analysis
9. T1
Stock Market Predictions BL1,BL5,BL6

3
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
UNIT V - BIGDATA MODELS
Knowledge Books Course
S.No Topic
level Referred Outcomes
1. Introduction to NoSQL BL2 T1
2. Aggregate Data Models BL3 T1
3. Hbase BL3,BL4 T1
4. Data Model and Implementations BL5 T1
5. Hbase Clients – Examples BL4 T1
C310.5
6. Pig Data Model BL2,BL3 T1
7. Hive – Data Types and File Formats BL4 T1
8. HiveQL Data Definition BL4 T1
HiveQL Data Manipulation – T1
9. BL4
HiveQL Queries
L1- Remembering, L2- Understanding, L3 – Applying, L4 –Analyzing, L5 – Evaluating,
L6 – Creating

(A) PROGRAM OUTCOMES (POs)


Engineering graduates will be able to:
1. Engineering Knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3. Design/development of solutions: Design solution for complex engineering problems and
design systems components or process that meet the specified needs with appropriate consideration
for the public health and safety, and the cultural, societal, and environmental considerations.
4. Conduct investigations of complex problems: Use research- based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with
an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
7. Environmental and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts and demonstrate the knowledge of, and need for
sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in

4
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
12. Life-Long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

((B) PROGRAM EDUCATIONAL OBJECTIVES (PEOs)


1. To Build next generation of highly skilled graduates with a strong knowledge in Artificial Intelligence
and Data Science to contribute and innovate new technologies for societal needs
2. To Create Engineers to promote collaborative learning and to exhibit their employability skills and
practise the ethics of their profession through innovation or entrepreneurship.
3. To Pursue graduate studies in the field of Data Science and to be committed in lifelong research
towards social, political and technical issues.
4. To Exhibit innovative thoughts in Engineering, Problem Solving and Critical Thinking skills to excel
in interdisciplinary domains.
(C) PROGRAM SPECIFIC OBJECTIVES (PSOs)
1. To understand, analyze and apply the AI based efficient domain specific processes for problem-
solving, inference, perception, knowledge representation and learning to design computer based systems
for varying complexity.
2. To implement search algorithms, neural networks, machine learning and data analytics to create
innovative solutions from idea to product for successful career and entrepreneurship.
3. To develop intelligent solutions and project development skills using Data Science technologies to
cater to the societal needs.
4. To provide a concrete foundation and enrich their abilities to qualify for Employment, Higher Studies
and Research in Artificial Intelligence and Data Science with ethical values.

UNIT I- INTRODUCTION TO BIG DATA


Defining Big Data – 5V’s of Big Data – Traditional Vs Big Data Systems -Big Data Applications - Risks of
Big Data – Structure of Big Data - Big Data Use Cases -Understanding Big Data Storage-Evolution of Big
Data-Big Data Technologies- Data Analytics Lifecycle-Data analytics lifecycle overview- Discovery- Data
Preparation.
UNIT-I / PART-A CO BL
1. What is big data? What are the Characteristics of Big data?(NOV C310.1 BL1
2023)
2. How does big data analysis help businesses increase their revenue? C310.1 BL2
5
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
Give an example.
3. What are the risks of big data? C310.1 BL1
4. What is the structure of big data? C310.1 BL1
5. How Filtering Big Data Effectively? C310.1 BL2
6. What web data reveals and What are the actions performed in web C310.1 BL1
data?
7. What is Data Discovery? C310.1 BL1
8. List the Data Discovery Use cases. C310.1 BL1
9. Define Data Profiling C310.1 BL1
10. Define Data Visualization. C310.1 BL1
11. List the categories of Big data applications. C310.1 BL1
12. Define Big data analytics.(MAY 2019) C310.1 BL1
13. List the categories of Big data applications. C310.1 BL1
14. What are the benefits offered by Big data to an organization in C310.1 BL1
increasing its value?
15. Difference between Manual Data Discovery and Smart Data C310.1 BL4
Discovery
16. Define Data Scalability C310.1 BL1
17. Define Data Integrity C310.1 BL1
18. Explain the phases of the Data Analytics Lifecycle?( MAY 2024) C310.1 BL2
19. Define Data Modeling. C310.1 BL1
20. What is meant by Data Munging? C310.1 BL1
21. Compare Traditional Data and Big Data C310.1 BL4
22. What is big data storage? C310.1 BL1
23. How is Apache Spark used for big data? C310.1 BL2
24. What is Data lake? C310.1 BL1
25. Justify the need for DFS for big data analytics(NOV 2023) C310.1 BL1
26. List the main characteristics of Big data.(MAY 2021) C310.1 BL1
UNIT-I / PART-B
1. Explain about the challenges in conventional systems. C310.1 BL2
2. List and explain some modern data analytic tools. C310.1 BL1
3. Explain the evolution of Analytics tools and methods. C310.1 BL2
4. Explain the evolution of the Analytics Process. C310.1 BL2
Discuss the techniques to handle increase in data that cannot be C310.1 BL4
5.
handled by a traditional database system.
6. What are the 7 essential steps of data analysis? C310.1 BL1
What are the issues and challenges related to storage and transport in C310.1 BL1
7.
big data?
8. List the steps involved in data preparation and Explain the Use Cases. C310.1 BL1
Discuss in detail about the characteristics of Big data applications. C310.1 BL2
(May 2021).Explain the role of big data analytics in the following
1. Credit card fraud detection.
9.
2. Clustering and data segmentation.
3. Recommendation engines.
4. Price Modeling.
10. Explain some popular big data technologies C310.1 BL2
6
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
11. Categorize Big data use cases with applications. C310.1 BL2
UNIT-I / PART-C
With the help of a neat diagram explain the organization of resources C310.1 BL4
1.
in a Big data platform.
Illustrate Why is data discovery important, & Explain the Steps for C310.1 BL3
2.
data discovery and common challenges in data discovery .
Context: A small online retail business is starting to grow rapidly and C310.1 BL6
expects to deal with increasing amounts of customer, inventory, and
sales data. They are considering adopting Big Data solutions to
improve their operations. Question:
3.
How would you introduce Big Data to this business? Design an initial
strategy that outlines the potential benefits and challenges of adopting
Big Data analytics for inventory management, customer insights, and
personalized marketing.
Context: A city is launching a Smart City initiative, collecting data C310.1 BL6
from traffic sensors, environmental monitors, public transportation,
and citizen feedback. The city aims to improve urban planning, traffic
management, and public services by utilizing Big Data. Question:
4.
How would you introduce Big Data technologies to support a Smart
City initiative? Design a system that integrates data from various
sources and explain how this data could be used to optimize city
management, reduce congestion, and improve public health.
UNIT II – DATA ANALYSIS
Overview of Clustering - K-means - Use Cases - Overview of the Method - Determining the Number of
Clusters. - Classification: Decision Trees - Overview of a Decision Tree - The General Algorithm -
Decision Tree Algorithms - Evaluating a Decision Tree - Decision Trees in R - Naïve Bayes – Bayes
Theorem - Naïve Bayes Classifier.
UNIT-II / PART-A
1. Explain data clustering with real time examples. C310.2 BL2
2. Describe K-Means algorithm in brief. C310.2 BL1
3. What are the key advantages and limitations of K-means clustering? C310.2 BL1
4. Explain the elbow method with its key principle in determining C310.2 BL2
optimal clusters.
5. List two validation metrics used to evaluate the quality of clustering C310.2 BL4
with different k values.
6. Explain how to determine the optimal number of clusters in K-means C310.2 BL2
clustering.
7. Differentiate Classification and Clustering. (Nov/Dec 2023) C310.2 BL1
8. What metrics are used to evaluate a Decision Tree model? C310.2 BL1
9. Explain the significance of entropy and information gain in Decision C310.2 BL2
Trees.
10. What is classification in machine learning? Give two examples. C310.2 BL1
11. Differentiate between binary and multi-class classification with C310.2
examples.
12. List and explain any two evaluation metrics used in classification. C310.2
13. What is a Decision Tree and how does it make predictions? C310.2 BL1
7
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
14. List any two popular decision tree algorithms and their key C310.2 BL1
differences.
15. Explain how pruning helps in improving Decision Tree performance. C310.2 BL2
16. State how does a Decision Tree handle missing values in the dataset? C310.2 BL2
17. What is Bayes' Theorem? Write its formula.(MAY 2021)(MAY C310.2 BL1
2024)
18. Difference between Bayesian Theorem and Naive Bayes Theorem C310.2 BL4
19. What is the role of prior probability in Naive Bayes classification? C310.2 BL1
20. Define likelihood in the context of Naive Bayes classifier. C310.2 BL1
21. What is the application of clustering in medical domain?(MAY 2021) C310.2 BL4
22. Why is the assumption of feature independence important in Naïve C310.2 BL1
Bayes classifier?
23. State the advantages of using Naïve Bayes for classification? C310.2 BL2
24. Where is Naive Bayes classification commonly used? Give two C310.2 BL1
applications.
25. What is silhouette analysis? C310.2 BL1
26. State common evaluation metrics for Classification. C310.2 BL2
27. State common evaluation metrics for Clustering. C310.2 BL2
28. State the difference between classification and clustering with real C310.2 BL2
time examples.
29. What role does Laplace smoothing play in Naive Bayes? C310.2 BL1
30. Define precision and recall in classification problems. C310.2 BL1
UNIT-II / PART-B
1. Explain K-Means algorithm with an Example. (Nov/Dec 2023)(MAY C310.2 BL2
2024)
2. Construct K-Means clusters for the given data point (2,10),(2,6), C310.2 BL5
(11,11),(6,9),(6,4),(1,2),(5,10),(4,9),(10,12),(7,5),(9,11),(4,6),
(3,10),(3,8),(6,11)
3. Explain the working of the decision tree, the assumptions, Splitting C310.2 BL2
Criteria and significance of pruning. (MAY 2021)(MAY 2024)
(Nov/Dec 2023)
4. Compare and Contrast classification and clustering algorithms with C310.2 BL4
respect to data, model and evaluation criteria along with suitable
examples
5. Build a Decision Tree for the given data using ID3 Algorithm C310.2 BL5
DATA WEATHER:
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
8
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
6. Build a Decision Tree for the above data (DATA WEATHER) using C310.2 BL5
CART Algorithm.
7. Examine the working of Naive Bayes classification, its assumptions, C310.2 BL3
advantages, limitations and applications. (MAY 2024)
8. Explain Bayes Theorem and find the P(Yes|Sunny) using Naive bayes C310.2 BL2
classifier
UNIT-II / PART-C
1. “Out of 500 bananas, 400 are long, 350 are sweet and 450 are yellow. C310.2 BL4
Out of 300 oranges, none are long, 150 are sweet and 300 are yellow
Out of the remaining 200 fruit, 100 are long, 150 are sweet and 50 are
yellow.”
Determine the fruit with the highest probability of satisfying the
parameters long, sweet and yellow. using Naive Bayes classification
algorithm
2. Compare and Contrast various evaluation metrics for Classification 310.2 BL4
and Clustering algorithms
3. Propose a data analysis strategy for optimizing traffic flow in a busy 310.2 BL6
city. How would you analyze real-time and historical data to identify
congestion points, and what predictive models would you use to
forecast traffic conditions and recommend optimal routes for drivers?
4. Propose a data analysis approach to perform sentiment analysis on 310.2 BL6
social media data. How would you preprocess the text data, what
techniques would you use to extract insights from it, and how would
you visualize the sentiment trends to guide the company's marketing
and customer service teams?
UNIT III - BIG DATA FILE SYSTEM
Google File System (GFS) -Distributed File Systems - Large-Scale File System Organization- Hadoop
Ecosystem- Hadoop Distributed File System (HDFS) concepts- HDFS Architecture- HDFS Commands-
Hadoop MapReduce -Map reduce Programming Model- Hadoop YARN-Case Studies-Word count
program.
UNIT-III / PART-A
1. What is the main purpose of the Google File System (GFS)? C310.3 BL1
2. How does Google File System handle fault tolerance? C310.3 BL1
3. What is DFS? C310.3 BL1
4. Name any two examples of Distributed File Systems. C310.3 BL1
5. What are Hadoop pipes? (Nov/Dec 2021) C310.3 BL1
6. List the modules of Hadoop. C310.3 BL2
7. State the advantages of Hadoop. C310.3 BL1
8. What is the Hadoop distributed file system? (Nov/Dec 2021) C310.3 BL1
9. Mention the contents of metadata maintained by the namenode C310.3 BL1
10. List out the uses of HDFS. C310.3 BL2

9
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
11. What is the need for the MapReduce function? (Apr/May 2022) C310.3 BL1
12. List the limitations of the MapReduce model. C310.3 BL2
13. List the five basic operations of the MapReduce programming model. C310.3 BL2
14. Define Map function. C310.3 BL1
15. What is the need of the Reduce function? C310.3 BL1
16. How will the limitations of MapReduce be overcome in future C310.3 BL1
versions of Hadoop?
17. Differentiate between JobTracker and Task Tracker. C310.3 BL2
18. Mention the general form of map and reduce functions in Hadoop C310.3 BL1
MapReduce.
19. What is YARN? (Apr/May2022) C310.3 BL1
20. What is a YARN scheduler? C310.3 BL1
21. List the major responsibilities of YARN. (Nov/Dec 2022) C310.3 BL2
22. What is the purpose of the scheduler in the resource manager of C310.3 BL1
YARN architecture?
23. What are the 4 components of Hadoop architecture? C310.3 BL1
24. Differentiate HDFS and MapReduce. C310.3 BL2
25. What is Unit test in MapReduce. C310.3 BL1
26. List the mapper and reduce formulas for matrix multiplication. C310.3 BL2
(Nov/Dec 2023)
27. Define MapReduce workflow in the context of data processing. C310.3 BL1
(Nov/Dec 2023)
28. What is the primary role of Yarn in a hadoop ecosystem?(Nov/Dec C310.3 BL1
2023)
29. In the context of Hadoop, what is the purpose of Hadoop pipes? C310.3 BL1
(Nov/Dec 2023)
30. Why is ensuring data integrity crucial in Hadoop distributed systems? C310.3 BL1
(Nov/Dec 2023)
UNIT III / PART B
1. Explain the architecture of Hadoop ecosystem (Nov/Dec 2021) C310.3 BL3

2. With the help of a neat sketch, explain in detail about Hadoop C310.3 BL3
streaming.
3. Briefly explain about Hadoop distributed file system with a neat C310.3 BL3
diagram.(Nov/Dec 2023)
4. Elaborate the impact of seamless Hadoop integration on enhancing C310.3 BL3
data processing and analytics. (Nov/Dec 2023)
5. Explain the components involved in the anatomy of a MapReduce job C310.3 BL3
run. (Nov/Dec 2023)
6. Discuss a MapReduce program to find the no of words in a text file. C310.3 BL3

7. Explain the MapReduce algorithm and its workflow with a suitable C310.3 BL3
example. (Nov/Dec 2023)
8. Explain in detail about YARN architecture. (Nov/Dec 2021) C310.3 BL3

UNIT-III / PART-C C310.3

10
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
1. Discuss about the functions of job tracker and task tracker with real C310.3 BL3
time scenario. (Apr/May 2022)
2. Explain a) Data integrity in HDFS b) Hadoop local file system. C310.3 BL3
(Apr/May 2022)
3. Propose a Big Data file system solution for the e-commerce platform C310.3 BL6
that can efficiently handle large volumes of transactional and
unstructured data. Discuss how Hadoop HDFS (Hadoop Distributed
File System) or other distributed file systems can be used to store and
process this data, and explain how fault tolerance, scalability, and
high availability are managed.
4. Design a Big Data file system solution that can integrate data from C310.3 BL6
multiple marketing platforms. How would you use a distributed file
system like HDFS or cloud storage to store and process this data,
ensuring it is easy to query and access for reporting? What steps
would you take to ensure the system is scalable as the volume of data
increases over time?

UNIT IV - MINING DATA STREAMS


Streams Concepts – Stream Data Model and Architecture Sampling Data in a Stream – Filtering Streams –
Counting Distinct Elements in a Stream – Estimating moments – Counting oneness in a Window –
Decaying Window – Real time Analytics Platform(RTAP) applications - Case Studies - Real Time
Sentiment Analysis, Stock Market Predictions.
UNIT IV /PART A
1. What is stream data? C310.4 BL1

2. List two characteristics of stream data. C310.4 BL2

3. Explain the difference between classical data mining and data stream C310.4 BL3
mining. (May 2024)

4. Define windowing in the context of stream data. C310.4 BL1

5. Differentiate between stateful and stateless stream processing. C310.4 BL2

6. What is a sliding window in stream processing? C310.4 BL1

7. What is filtering in stream data processing? C310.4 BL1

8. Name two probabilistic algorithms used for counting distinct C310.4 BL1
elements in streams.
9. What does counting "oneness" in a window mean? C310.4 BL1

10. What is decaying window ?Why use a decaying window in data C310.4 BL1
streams?(NOV 2023)(MAY 2021)
11. Why are moments estimated in data streams? C310.4 BL1

12. What is a Real-Time Analytics Platform (RTAP)?(MAY 2024) C310.4 BL1

13. Name two applications of Real-Time Analytics Platforms. C310.4 BL1

11
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
14. Why are Real-Time Analytics Platforms used in IoT applications? C310.4 BL1

15. How does RTAP benefit e-commerce? C310.4 BL1

16. Give an example of RTAP in financial services. C310.4 BL1

17. What is real-time stock market prediction? C310.4 BL1

18. Why is Real-Time Sentiment Analysis important for businesses? C310.4 BL1

19. How does Real-Time Sentiment Analysis work? C310.4 BL1

20. List one benefit of real-time stock market prediction. C310.4 BL1

21. What types of data are used in real-time stock market prediction? C310.4 BL1

22. Name one technique used in real-time stock market prediction. C310.4 BL1

23. What is sketching in data stream mining? C310.4 BL1

24. Give an example where counting distinct elements in a stream is C310.4 BL1
useful.
25. Why is counting distinct elements challenging in streaming data? C310.4 BL1

UNIT-IV / PART-B
1. Describe what a stream data model is and its primary characteristics. C310.4 BL2
2. Explain the stream data model and its architecture. Discuss the C310.4 BL3
components, techniques, and challenges involved in stream data
processing.
3. What are the main challenges in stream data processing, and how do C310.4 BL2
different architectural components address these challenges?
4. Describe the role of data stream processing platforms in real-time C310.4 BL2
analytics. Discuss the architecture, features, challenges, and
applications with examples.
5. Discuss how counting oneness in a window is implemented in stream C310.4 BL2
processing systems. Include the role of windows, algorithms used,
and challenges faced.
6. Explain the concept of "counting oneness in a window" in data C310.4 BL3
streams. Describe the types of windows, techniques, and challenges
involved. Provide examples of its applications (MAY 2024) (MAY
2021)
7. Explain the Alon-Matias-Szegedy (AMS) algorithm, including its C310.4 BL3
purpose,key concepts, working, and applications in data stream
processing.(MAY 2021)

8. Explain what real-time sentiment analysis is and why stream C310.4 BL3
processing is essential for it.

UNIT-IV / PART-C

1. Explain nodes, edges, and types of graphs (directed, undirected, C310.4 BL3
12
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
weighted).

2. Discuss key graph algorithms, such as PageRank, community C310.4 BL2


detection, and shortest path.(NOV 2023)(MAY 2021)

3. Design a data mining strategy for real-time fraud detection in banking C310.4 BL6
transactions. How would you mine the continuous stream of
transaction data to identify unusual patterns or behaviors indicative of
fraud? Which algorithms would you use for classification or anomaly
detection, and how would you ensure minimal false positives in a
high-volume, real-time environment?

4. Design a real-time data mining strategy to predict customer churn for C310.4 BL6
a subscription service. How would you mine user behavior data
streams to identify patterns that indicate an increased likelihood of
churn? Which data mining algorithms (e.g., decision trees, clustering,
association rule mining) would you use, and how would you address
the challenges of handling dynamic, continuous data?

UNIT V - BIGDATA MODELS


Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase
Clients – Examples – Pig Data Model –Hive – Data Types and File Formats – HiveQL Data Definition –
HiveQL Data Manipulation – HiveQL Queries
UNIT-V / PART-A
1. Define Pig. C310.5 BL1
2. Define Hive. (NOV 2023) C310.5 BL2
3. Why Nosql? C310.5 BL3
4. Define CAP theorm. C310.5 BL1
5. Define Consistency Properity. C310.5 BL2
6. What is atomicity property? C310.5 BL1
7. What is commit and abort? C310.5 BL1
8. Define Consistency? C310.5 BL1
9. Define Isolation. C310.5 BL1
10. Define Durability. C310.5 BL1
11. Define Availabilty. C310.5 BL1
12. Define Partition Tolerence. C310.5 BL1
13. What kinds of NoSQL are available? C310.5 BL1
14. What are the advantages of Nosql? C310.5 BL1
15. What is HBase? C310.5 BL1
16. What are the Benefits of HBase? C310.5 BL1
17. Why we have to choose HBase? C310.5 BL1
18. Define Data model of HBase. C310.5 BL1
19. Draw architectural diagram of HBase. C310.5 BL1
20. Define Key value stores.(NOV 2023)(MAY 2021) C310.5 BL1
21. Define Data Model of Hive. C310.5 BL1
22. Differentiate SQL and NOSQL(NOV/DEC-2022) C310.5 BL1
23. What is Sharding?(NOV/DEC-2022) C310.5 BL1

13
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
24. Why were schema-less models developed(April/May-2019) C310.5 BL1
25. How do you select distinct values from a column in Hive? C310.5 BL1
UNIT-V / PART-B
1. Explain NoSQL and its data model?(MAY 2024) C310.5 BL2
2. Explain Hbase data model architectureand its C310.5 BL3
implementation?(April/May-2019) (NOV 2023)
3. Explain Pig data model with examples?(NOV 2023) C310.5 BL2
4. Explain Hive data model architecture with examples. (MAY 2024) C310.5 BL2
5. Explain HiveQL data definition and manipulation? C310.5 BL2
6. Explain the process of creating a table in HiveQL. Include details on C310.5 BL3
specifying columns, data types, table properties, and different storage
formats like TEXTFILE, ORC, and PARQUET. Discuss how
external tables differ from managed tables in Hive.
7. Describe the process of loading data into a Hive table from both local C310.5 BL3
and HDFS file systems. How do you use LOAD DATA with
partitioned tables? Explain the concept of dynamic partitioning in
Hive.
8. Write a HiveQL query to demonstrate the use of GROUP BY, C310.5 BL3
HAVING, and aggregate functions like COUNT, AVG, and SUM.
Explain how GROUP BY works and when to use HAVING instead
of WHERE.
UNIT-V / PART-C

1. Discuss the key features of Hive as a data warehousing solution on C310.5 BL3
Hadoop. Explain its architecture and how it enables querying large
datasets using HiveQL. Compare Hive with traditional RDBMS.
2. Explain the data model used in Apache Pig. Discuss the different data C310.5 BL2
types supported in Pig (e.g., scalar types, tuples, bags, and maps).
Provide examples of how these data types are used in Pig scripts.
3. Create a Big Data model to monitor patient health in real-time using C310.5 BL6
wearable devices. Which predictive modeling techniques would you
use to detect anomalies or early signs of health risks? How would you
ensure the model is accurate, scalable, and privacy-compliant (e.g.,
HIPAA)?
4. Design a Big Data model to detect fake news in real-time news C310.5 BL6
streams. Which natural language processing (NLP) techniques and
machine learning models would you use to classify news articles as
genuine or fake? How would you deal with the large volume and
complexity of data from different sources in real-time?

14
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025

15
St. Joseph’s Institute of Technology

You might also like