Shivaji University, Kolhapur
Question Bank for Mar 2022 (Summer) Examination
Subject Code: 84719, Subject Name: Big Data Analytics
Sr.
Question A B C D
No.
What are the main components of All of the
1 Hadoop? MapReduce HDFS YARN
above
The Big data analytics work on the
None of the
2 unstructured data, where no specific True False Can’t Say
above
pattern of the data is defined.
Identify the incorrect big data Apache Apache
3 Technologies. Apache Kafka Apache Spark
Pytorch Hadoop
Identify among the options below
which is general-purpose computing All of the
4 HDFS MapReduce Oozie
model and runtime system for above
Distributed Data Analytics.
Big data analysis does the following
5 Spreads data Analyze data Organizes data Collect data
except?
What is NOT a characteristic of big
6 Volume Variety Vision Velocity
data?
Pig is a Hadoop-based open-source
platform for analyzing the large-scale
7 datasets via its own SQL-like language Pig Latin Pig German Pig Roman Pig Italian
_______
The key aspect of the MapReduce
algorithm is that if every Map and
Reduce is independent of all other series on parallel on parallel on
8 series on same
ongoing Maps and Reduces in the different different same
network, the operation will run in
______ keys and lists of data.
In Hadoop MapReduce, _____ is a Java
class that comes with several methods RecordCollect
9 Mapper RecordReader Reporter
to retrieve key and values by iterating or
them among the data splits.
TaskTracker JobTracker NameNode DataNode
10 Which of the following scenarios
failure failure failure failure
makes HDFS unavailable?
Hadoop MapReduce is a popular_____
for easily written applications. It
Spring Java Django Web
11 processes vast amounts of data (multi-
framework framework framework framework
terabyte datasets) in parallel on large
clusters (thousands of nodes).
Which is not a way to link R and Hadoop
12 Hadoop? RHIPE RHadoop RHDFS
streaming
The RHIPE package uses the ________
technique to perform data analytics Divide and Divide and Integrate and None of the
13
over Big Data. recombine conquer recombine above
Phase 3: Phase 4: Phase 5:
_____ phase of the data analytics Phase 2: Data
14 Model Model Communicate
lifecycle usually takes the longest time. Preparation
Planning Building Results
Identifying the Identifying the
Identifying the Identifying the
problem>desig problem >
problem>desig problem>
ning the performing
ning the visualizing
requirements> analytics over
requirements> data
The data analytics project life cycle >performing data
pre-processing >designing the
15 stages in correct sequence are analytics over >designing the
data>performi requirements>
__________ data> pre- requirements>
ng analytics pre-processing
processing pre-processing
over data> data>performi
data > data>
visualizing ng analytics
visualizing visualizing
data over data
data data
Which of the following is/are true
about Random Forest and Gradient
Boosting ensemble methods?
1. Both methods can be used for
classification task
16 1 2 2 and 3 1 and 4
2. Random Forest is used for
classification whereas Gradient
Boosting is used for regression
task
3. Random Forest is used for
regression whereas Gradient
Boosting is used for
Classification task
4. Both methods can be used for
regression task
In Random forest you can generate
hundreds of trees (say T1, T2 …..Tn)
and then aggregate the results of these
tree. Which of the following is true
about individual(Tk) tree in Random
Forest?
1. Individual tree is built on a
17 1 and 3 1 and 4 2 and 3 2 and 4
subset of the features
2. Individual tree is built on all the
features
3. Individual tree is built on a
subset of observation
4. Individual tree is built on full
set of observations
The primary Machine Learning API for All of the
18 Dataframe Dataset RDD
Spark is now the _____ based API above
Which of the following is a module for
19 GraphX MLib SparkSQL Spark R
Structured data processing?
SparkSQL translates commands into
Executor Cluster None of the
20 codes. These codes are processed by Driver Nodes
nodes Manager above
________
SparkSQL plays the main role in the
21 True False Can’t Say None is correct
optimization of queries.
Which of the following is not a Logical Physical
22 Analysis Execution
SparkSQL query execution phases? Optimization Planning
Takes RDD as
The ways to
input and Creates one or
send result All of the
23 What is action in Spark RDD? produces one many new
from executors above
or more RDD RDDs
to the driver
as output.
The data The data
required to required to
Which of the following is true about compute compute None of the
24 Both
narrow transformation? resides on resides on the above
multiple single
partitions. partition.
__________ is a distributed machine Spark
25 MLib GraphX RDDs
learning framework on top of Spark. Streaming
Which of following component of
Cluster Driver
26 Spark runtime architecture provides Worker nodes Spark context
manager program
resources to execute a task?
Semi
Among the following option identify Unsupervised Reinforcement Supervised
27 unsupervised
the one which is not a type of learning. learning learning learning
learning
Semi
Identify the type of learning in which Unsupervised Reinforcement Supervised
28 unsupervised
labeled training data is used. learning learning learning
learning
Machine learning is a subset of which Artificial None of the
29 Deep Learning Data Learning
of the following? Intelligence above
Which of the following machine
Anomaly All of the
30 learning techniques helps in detecting Classification Clustering
Detection above
the outliers in data?
Which of the following are common
All of the
31 classes of problems in machine Regression Classification Clustering
above
learning?
Tries to Similarity
recommend among users
What is content based items based on Similarity buying, All of the
32
recommendation system? profile built among items watching, or above
from their enjoying
preferences something
Machine Learning is a field of AI
At executing Over time with Improve their All of the
33 consisting of learning algorithms that
some task experience performance above
__________
Which of the following machine
34 learning algorithm is based upon the Decision tree Random forest Classification Regression
idea of bagging?
Among the following options identify It relates It discovers
It is used for It is used for
35 the one which is false regarding inputs to casual
the prediction interpretation
regression. outputs relationships
UNIT – I
1. Define Big Data? Explain the Characteristics / V’s of Bigdata?
2. Write a note on: Drivers for Big Data.
3. Explain different applications of Big Data.
4. Write a note on: Data Privacy Protection.
5. With neat diagram depict the Product Knowledge Hub in Big Data?
6. Write a short note on Location Based Services in Big Data?
7. Explain the architectural components of Big Data?
8. Explain Real-time Adaptive Analytics and Decision engine?
9. Explain Massively Parallel Processing (MPP) platforms.
10. Explain Unstructured Data Analytics and Reporting.
UNIT – II
1. Explain the features of R Language?
2. Explain different phases of MapReduce with an example?
3. What is HDFS? Explain the features of HDFS?
4. Explain the HDFS and MapReduce architecture.
5. List and explain different components of Hadoop.
6. Explain in detail the stages of Hadoop MapReduce data processing.
7. Explain in detail the dataflow of MapReduce with diagram.
8. Explain the limitations of MapReduce.
9. Explain the data mining techniques which are used to perform data modeling in R.
10. Mention different Hadoop installation modes? Explain each of them.
UNIT – III
1. Explain the architecture of RHIPE.
2. Explain RHadoop in detail.
3. Explain the architecture of RHadoop
4. Explain the working of RHadoop with example?
5. Explain the hstable reader function for Hadoop streaming.
6. Explain the hskeyval reader function for Hadoop streaming.
7. Explain the Hadoop streaming components?
8. Explain the format of Hadoop Streaming commands with each line?
UNIT – IV
1. Explain Data Analytics project life cycle stages.
2. Explain how data analytics problem for calculating the frequency of stock market
changes can be solved using MapReduce.
3. Write a case study for predicting the auction sales price of heavy equipment to
create a blue book for bulldozers.
4. Explain Poisson-approximation resampling technique on the Map of the
MapReduce task.
5. How will data analytics help to identify the category of a web page of a website,
which may categorize popularity wise as high, medium, or low (regular), based on
the visit count of the pages.
6. Write steps to build and run the MapReduce algorithm with R and Hadoop
integration for web page categorization problem.
7. Explain pre-processing and performing analytics over any data?
8. Explain how MapReduce problem is designed for computing the frequency of
stock market changes.
UNIT – V
1. What is Resilient Distributed Dataset (RDD)? Explain transformations and actions
in RDD. Explain RDD operations in brief?
2. Why Spark is preferred over Hadoop? Explain the limitations of Hadoop?
3. Explain how Spark overcomes the limitations of Hadoop.
4. Briefly explain the core components in Spark.
5. Explain the architecture of Spark.
6. What is SparkContext in Apache Spark?
7. What is a Directed acyclic graphs (DAG) in Spark, and how does it work?
8. What are Spark DataFrames? Why do we use them in Spark?
9. Explain Apache Spark RDD Operations in detail.
10. What are different types of RDD transformation? Explain functions in RDD
transformation.
11. What are RDD actions? When they are used? Explain Spark actions.
12. What are the deployment modes in Spark? What is difference between client and
cluster mode deployment?
13. What are the components of Spark architecture?
14. What is Spark core? What are the various functions of Spark core? Which is a
component on the top of Spark core?
15. What are the components of Spark Streaming? What is Spark Streaming used for?
UNIT – VI
1. What is machine learning? Explain types of machine-learning algorithms.
2. Explain Supervised Machine Learning Algorithm.
3. Explain how Linear regression is performed using with R and Hadoop?
4. Explain how logistic regression is performed using with R and Hadoop?
5. Explain Unsupervised Machine Learning Algorithm.
6. Explain steps to performing clustering with R and Hadoop.
7. Explain Steps to generate recommendations in R.
8. What is recommendation algorithm? Explain two different types of
Recommendations Algorithms.
9. How do you create a recommendation algorithm with R and Hadoop?
10. How one can use R and Hadoop together to generate recommendations from big
datasets?