QB - Updated 1
QB - Updated 1
COURSE OUTCOMES
On successful completion of this course, the student will be able to
C310.1 To know the fundamental concepts of big data and analytics
C310.2 To explore tools of big data and analytics.
C310.3 To know the practices for working with big data
C310.4 To learn about stream computing
1
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
C310.5 To know about the research that requires the integration of large amounts of data.
2
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
4. Classification: Decision Trees BL1 T1
Overview of a Decision Tree - The T1
5. BL4
General Algorithm
6. Decision Tree Algorithms BL2,BL5 T1
7. Evaluating a Decision Tree BL4 T1
8. Decision Trees in R - Naïve Bayes BL2 T1
Bayes Theorem - Naïve Bayes T1
9. BL2,BL6
Classifier.
3. T1
Filtering Streams BL1,BL4
4. T1
Counting Distinct Elements in a Stream BL3
5. T1
Estimating moments BL4 C310.4
6. T1
Counting oneness in a Window BL2,BL4
3
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
UNIT V - BIGDATA MODELS
Knowledge Books Course
S.No Topic
level Referred Outcomes
1. Introduction to NoSQL BL2 T1
2. Aggregate Data Models BL3 T1
3. Hbase BL3,BL4 T1
4. Data Model and Implementations BL5 T1
5. Hbase Clients – Examples BL4 T1
C310.5
6. Pig Data Model BL2,BL3 T1
7. Hive – Data Types and File Formats BL4 T1
8. HiveQL Data Definition BL4 T1
HiveQL Data Manipulation – T1
9. BL4
HiveQL Queries
L1- Remembering, L2- Understanding, L3 – Applying, L4 –Analyzing, L5 – Evaluating,
L6 – Creating
4
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive clear
instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and leader
in a team, to manage projects and in multidisciplinary environments.
12. Life-Long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
9
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
11. What is the need for the MapReduce function? (Apr/May 2022) C310.3 BL1
12. List the limitations of the MapReduce model. C310.3 BL2
13. List the five basic operations of the MapReduce programming model. C310.3 BL2
14. Define Map function. C310.3 BL1
15. What is the need of the Reduce function? C310.3 BL1
16. How will the limitations of MapReduce be overcome in future C310.3 BL1
versions of Hadoop?
17. Differentiate between JobTracker and Task Tracker. C310.3 BL2
18. Mention the general form of map and reduce functions in Hadoop C310.3 BL1
MapReduce.
19. What is YARN? (Apr/May2022) C310.3 BL1
20. What is a YARN scheduler? C310.3 BL1
21. List the major responsibilities of YARN. (Nov/Dec 2022) C310.3 BL2
22. What is the purpose of the scheduler in the resource manager of C310.3 BL1
YARN architecture?
23. What are the 4 components of Hadoop architecture? C310.3 BL1
24. Differentiate HDFS and MapReduce. C310.3 BL2
25. What is Unit test in MapReduce. C310.3 BL1
26. List the mapper and reduce formulas for matrix multiplication. C310.3 BL2
(Nov/Dec 2023)
27. Define MapReduce workflow in the context of data processing. C310.3 BL1
(Nov/Dec 2023)
28. What is the primary role of Yarn in a hadoop ecosystem?(Nov/Dec C310.3 BL1
2023)
29. In the context of Hadoop, what is the purpose of Hadoop pipes? C310.3 BL1
(Nov/Dec 2023)
30. Why is ensuring data integrity crucial in Hadoop distributed systems? C310.3 BL1
(Nov/Dec 2023)
UNIT III / PART B
1. Explain the architecture of Hadoop ecosystem (Nov/Dec 2021) C310.3 BL3
2. With the help of a neat sketch, explain in detail about Hadoop C310.3 BL3
streaming.
3. Briefly explain about Hadoop distributed file system with a neat C310.3 BL3
diagram.(Nov/Dec 2023)
4. Elaborate the impact of seamless Hadoop integration on enhancing C310.3 BL3
data processing and analytics. (Nov/Dec 2023)
5. Explain the components involved in the anatomy of a MapReduce job C310.3 BL3
run. (Nov/Dec 2023)
6. Discuss a MapReduce program to find the no of words in a text file. C310.3 BL3
7. Explain the MapReduce algorithm and its workflow with a suitable C310.3 BL3
example. (Nov/Dec 2023)
8. Explain in detail about YARN architecture. (Nov/Dec 2021) C310.3 BL3
10
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
1. Discuss about the functions of job tracker and task tracker with real C310.3 BL3
time scenario. (Apr/May 2022)
2. Explain a) Data integrity in HDFS b) Hadoop local file system. C310.3 BL3
(Apr/May 2022)
3. Propose a Big Data file system solution for the e-commerce platform C310.3 BL6
that can efficiently handle large volumes of transactional and
unstructured data. Discuss how Hadoop HDFS (Hadoop Distributed
File System) or other distributed file systems can be used to store and
process this data, and explain how fault tolerance, scalability, and
high availability are managed.
4. Design a Big Data file system solution that can integrate data from C310.3 BL6
multiple marketing platforms. How would you use a distributed file
system like HDFS or cloud storage to store and process this data,
ensuring it is easy to query and access for reporting? What steps
would you take to ensure the system is scalable as the volume of data
increases over time?
3. Explain the difference between classical data mining and data stream C310.4 BL3
mining. (May 2024)
8. Name two probabilistic algorithms used for counting distinct C310.4 BL1
elements in streams.
9. What does counting "oneness" in a window mean? C310.4 BL1
10. What is decaying window ?Why use a decaying window in data C310.4 BL1
streams?(NOV 2023)(MAY 2021)
11. Why are moments estimated in data streams? C310.4 BL1
11
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
14. Why are Real-Time Analytics Platforms used in IoT applications? C310.4 BL1
18. Why is Real-Time Sentiment Analysis important for businesses? C310.4 BL1
20. List one benefit of real-time stock market prediction. C310.4 BL1
21. What types of data are used in real-time stock market prediction? C310.4 BL1
22. Name one technique used in real-time stock market prediction. C310.4 BL1
24. Give an example where counting distinct elements in a stream is C310.4 BL1
useful.
25. Why is counting distinct elements challenging in streaming data? C310.4 BL1
UNIT-IV / PART-B
1. Describe what a stream data model is and its primary characteristics. C310.4 BL2
2. Explain the stream data model and its architecture. Discuss the C310.4 BL3
components, techniques, and challenges involved in stream data
processing.
3. What are the main challenges in stream data processing, and how do C310.4 BL2
different architectural components address these challenges?
4. Describe the role of data stream processing platforms in real-time C310.4 BL2
analytics. Discuss the architecture, features, challenges, and
applications with examples.
5. Discuss how counting oneness in a window is implemented in stream C310.4 BL2
processing systems. Include the role of windows, algorithms used,
and challenges faced.
6. Explain the concept of "counting oneness in a window" in data C310.4 BL3
streams. Describe the types of windows, techniques, and challenges
involved. Provide examples of its applications (MAY 2024) (MAY
2021)
7. Explain the Alon-Matias-Szegedy (AMS) algorithm, including its C310.4 BL3
purpose,key concepts, working, and applications in data stream
processing.(MAY 2021)
8. Explain what real-time sentiment analysis is and why stream C310.4 BL3
processing is essential for it.
UNIT-IV / PART-C
1. Explain nodes, edges, and types of graphs (directed, undirected, C310.4 BL3
12
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
weighted).
3. Design a data mining strategy for real-time fraud detection in banking C310.4 BL6
transactions. How would you mine the continuous stream of
transaction data to identify unusual patterns or behaviors indicative of
fraud? Which algorithms would you use for classification or anomaly
detection, and how would you ensure minimal false positives in a
high-volume, real-time environment?
4. Design a real-time data mining strategy to predict customer churn for C310.4 BL6
a subscription service. How would you mine user behavior data
streams to identify patterns that indicate an increased likelihood of
churn? Which data mining algorithms (e.g., decision trees, clustering,
association rule mining) would you use, and how would you address
the challenges of handling dynamic, continuous data?
13
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
24. Why were schema-less models developed(April/May-2019) C310.5 BL1
25. How do you select distinct values from a column in Hive? C310.5 BL1
UNIT-V / PART-B
1. Explain NoSQL and its data model?(MAY 2024) C310.5 BL2
2. Explain Hbase data model architectureand its C310.5 BL3
implementation?(April/May-2019) (NOV 2023)
3. Explain Pig data model with examples?(NOV 2023) C310.5 BL2
4. Explain Hive data model architecture with examples. (MAY 2024) C310.5 BL2
5. Explain HiveQL data definition and manipulation? C310.5 BL2
6. Explain the process of creating a table in HiveQL. Include details on C310.5 BL3
specifying columns, data types, table properties, and different storage
formats like TEXTFILE, ORC, and PARQUET. Discuss how
external tables differ from managed tables in Hive.
7. Describe the process of loading data into a Hive table from both local C310.5 BL3
and HDFS file systems. How do you use LOAD DATA with
partitioned tables? Explain the concept of dynamic partitioning in
Hive.
8. Write a HiveQL query to demonstrate the use of GROUP BY, C310.5 BL3
HAVING, and aggregate functions like COUNT, AVG, and SUM.
Explain how GROUP BY works and when to use HAVING instead
of WHERE.
UNIT-V / PART-C
1. Discuss the key features of Hive as a data warehousing solution on C310.5 BL3
Hadoop. Explain its architecture and how it enables querying large
datasets using HiveQL. Compare Hive with traditional RDBMS.
2. Explain the data model used in Apache Pig. Discuss the different data C310.5 BL2
types supported in Pig (e.g., scalar types, tuples, bags, and maps).
Provide examples of how these data types are used in Pig scripts.
3. Create a Big Data model to monitor patient health in real-time using C310.5 BL6
wearable devices. Which predictive modeling techniques would you
use to detect anomalies or early signs of health risks? How would you
ensure the model is accurate, scalable, and privacy-compliant (e.g.,
HIPAA)?
4. Design a Big Data model to detect fake news in real-time news C310.5 BL6
streams. Which natural language processing (NLP) techniques and
machine learning models would you use to classify news articles as
genuine or fake? How would you deal with the large volume and
complexity of data from different sources in real-time?
14
St. Joseph’s Institute of Technology
IT4651 – Big Data Analytics Department of CSE 2024-2025
15
St. Joseph’s Institute of Technology