2022 Assignment Answers
2022 Assignment Answers
___________________________________________________________________________
Q. 1 True or False ?
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a data
with so large size and complexity that none of traditional data management tools can store it or process
it efficiently.
• True
• False
Answer: True
Explanation: Velocity is the speed at which data is processed. This includes input such as processing
of social media posts and output such as the processing required to produce a report or execute a process.
particular use.
• Value
• Veracity
• Velocity
• Validity
Answer: Validity
Explanation: Validity refers to the accuracy and correctness of the data relative to a particular use.
Statement 2: Volatility refers to the rate of data loss and stable lifetime of data.
Statement 1: Viscosity refers to the data velocity relative to timescale of event being studied
Statement 2: Volatility refers to the rate of data loss and stable lifetime of data
• HDFS
• YARN
• Map Reduce
• PIG
Explanation: Map Reduce is a programming model and an associated implementation for processing
and generating large data sets.
Q. 6 _______________is an open source software framework for big data. It has two basic parts: HDFS
and Map Reduce.
• Spark
• HBASE
• HIVE
• Apache Hadoop
Explanation: Apache Hadoop is an open source software framework for big data
a) Hadoop Distributed File System (HDFS) is the storage system of Hadoop which splits big data and
distribute across many nodes in a cluster
• Hadoop Common
• Hadoop Distributed File System (HDFS)
• Hadoop YARN
• Hadoop MapReduce
Answer: Hadoop YARN
Explanation:
Hadoop Common: It contains libraries and utilities needed by other Hadoop modules.
Hadoop Distributed File System (HDFS): It is a distributed file system that stores data on a commodity
machine. Providing very high aggregate bandwidth across the entire cluster.
Hadoop YARN: It is a resource management platform responsible for managing compute resources in
the cluster and using them in order to schedule users and applications. YARN is responsible for
allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks
to be executed on different cluster nodes
Hadoop MapReduce: It is a programming model that scales data across a lot of different processes.
Q. 8 ____________is a highly reliable distributed coordination kernel , which can be used for
distributed locking, configuration management, leadership election, and work queues etc.
• Apache Sqoop
• Mahout
• Flume
• ZooKeeper
Answer: ZooKeeper
Explanation: ZooKeeper is a central store of key value using which distributed systems can coordinate.
Since it needs to be able to handle the load, Zookeeper itself runs on many machines.
• Hive
• Cassandra
• Apache Kafka
• RDDs
Explanation: Apache Kafka is an open source stream processing software platform developed by the
Apache Software Foundation written in Scala and Java.
Q. 10 True or False ?
NoSQL databases are non-tabular databases and store data differently than relational tables. NoSQL
databases come in a variety of types based on their data model. The main types are document, key-
value, wide-column, and graph. They provide flexible schemas and scale easily with large amounts of
data and high user loads.
• True
• False
Answer: True
Explanation: While the traditional SQL can be effectively used to handle large amount of structured
data, we need NoSQL (Not Only SQL) to handle unstructured data. NoSQL databases store unstructured
data with no particular schema
___________________________________________________________________________
Quiz Assignment-II Solutions: Big Data Computing (Week-2)
___________________________________________________________________________
A. Data Node
B. Name Node
C. Data block
D. Replication
Explanation: Name Node works as a master server that manages the file system namespace
and basically regulates access to these files from clients, and it also keeps track of where the
data is on the Data Nodes and where the blocks are distributed essentially. On the other
hand Data Node is the slave/worker node and holds the user data in the form of Data
Blocks.
Q. 2 - When a client contacts the name node for accessing a file, the name node responds
with
Answer: D. Block ID and hostname of all the data nodes containing that block.
Explanation: A name node is a master server that manages the file system namespace and
basically regulates access to these files from clients, and it also keeps track of where the
data is on the DataNodes and where the blocks are distributed essentially.
Q. 3 The namenode knows that the datanode is active using a mechanism known as
A. datapulse
B. h-signal
C. heartbeats
D. Active-pulse
Answer: C. heartbeats
Explanation: In Hadoop Name node and data node do communicate using Heartbeat.
Therefore Heartbeat is the signal that is sent by the datanode to the namenode after the
regular interval to time to indicate its presence, i.e. to indicate that it is alive.
A. NameNode
B. Checkpoint Node
C. DataNode
D. None of the mentioned
Answer: A. NameNode
Explanation: To read/write a file in HDFS, a client needs to interact with master i.e.
namenode (master).
Q. 5 True or False ?
A. True
B. False
Answer: A) True
Explanation: Once the data is written in HDFS it is immediately replicated along the cluster,
so that different copies of data will be stored on different data nodes. Normally the
Replication factor is 3 as due to this the data does not remain over replicated nor it is less.
Statement 1: Task Tracker is hosted inside the master and it receives the job execution
request from the client.
Statement 2: Job tracker is the MapReduce component on the slave machine as there are
multiple slave machines.
The Job Tracker is hosted inside the master and it receives the job execution request from
the client.
Task tracker is the MapReduce component on the slave machine as there are multiple slave
machines.
Statement 2: Users specify a map function that processes a key/value pair to generate a set
of intermediate key/value pairs, and a reduce function that merges all intermediate values
associated with the same intermediate key.
A. YARN extends the power of Hadoop to incumbent and new technologies found
within the data center
B. YARN is highly scalable.
C. YARN enhances a Hadoop compute cluster in many ways
D. All of the mentioned
A. Only map()
B. Only reduce()
C. map() and reduce()
D. The code does not have to be changed
Explanation:
File contents:
Length of words
Average length of all the words= total length of all the words/total number of
words=30/8=3.75
Step 3: Output the number 1 as key each word as value. The sample output of mapper
would look like
KEY: 1, VALUE: this
Step 4: Repeat the above steps for all the words in the line
KEY VALUE
1 this
1 is
1 a
1 sample
1 file
1 fit
1 for
1 nothing
All the keys are 1 here. So the output would look like…
KEY VALUE
All the words in the file would indeed be in the list of values passed on to reducer.
And finally compute the average length of each word since we know the total length of all
the words and also the total count of words.
___________________________________________________________________________
Quiz Assignment-III Solutions: Big Data Computing (Week-3)
___________________________________________________________________________
Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python and
also provides an interactive shell.
Q. 2 True or False ?
A. True
B. False
Answer: True
A. Spark Streaming
B. FlatMap
C. Resilient Distributed Dataset (RDD)
D. Driver
Q. 4 Given the following definition about the join transformation in Apache Spark:
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
Where join operation is used for joining two datasets. When it is called on datasets of
type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of
elements for each key.
Output the result of joinrdd, when the following code is run.
Q. 5 True or False ?
Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop
MapReduce in memory, or 10 times faster on disk.
A. True
B. False
Answer: A) True
Explanation: The biggest claim from Spark regarding speed is that it is able to "run
programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk."
Spark could make this claim because it does the processing in the main memory of the
worker nodes and prevents the unnecessary I/O operations with the disks. The other
advantage Spark offers is the ability to chain the tasks even at an application programming
level without writing onto the disks at all or minimizing the number of writes to the disks.
A. MLlib
B. GraphX
C. RDDs
D. Spark Streaming
A. GraphX
B. MLlib
C. Spark streaming
D. All of the mentioned
Answer: A) GraphX
Explanation: GraphX is Apache Spark's API for graphs and graph-parallel computation. It is
a distributed graph processing framework on top of Spark.
Q. 8 Which of the following are the simplest NoSQL databases ?
A. Wide-column
B. Key-value
C. Document
D. All of the mentioned
Answer: B) Key-value
Explanation: Every single item in the database is stored as an attribute name (or “key”),
together with its value in Key-value stores.
Statement 1: Scale out means grow your cluster capacity by replacing with more powerful
machines.
Statement 2: Scale up means incrementally grow your cluster capacity by adding more
COTS machines (Components Off the Shelf).
Scale out = incrementally grow your cluster capacity by adding more COTS machines
(Components Off the Shelf)
__________________________________________________________________________
Quiz Assignment-IV Solutions: Big Data Computing (Week-4)
___________________________________________________________________________
A. Simple strategy
B. Network topology strategy
C. Quorum strategy
D. None of the mentioned
Explanation: Simple strategy treats the entire cluster as a single data center. It is suitable for
a single data center and one rack. This is also called as Rack-Unaware Strategy. Simple
Strategy uses the partitioner of which there are two kinds: Random Partitioner and Byte
Ordered Partitioner.
A. Simple strategy
B. Network topology strategy
C. Quorum strategy
D. None of the mentioned
Explanation: Network topology strategy is used to specify data centers and the number of
replicas to place within each data center. It attempts to place replicas on distinct racks to avoid
the node failure and to ensure data availability. In network topology strategy, the two most
common ways to configure multiple data center clusters are: Two replicas in each data center,
and Three replicas in each data center.
Q. 3 True or False ?
A Snitch determines which data centers and racks nodes belong to. Snitches inform Cassandra
about the network topology so that requests are routed efficiently and allows Cassandra to
distribute replicas by grouping machines into data centers and racks.
A. True
B. False
Answer: True
Explanation: A Snitch determines which data centers and racks nodes belong to. Snitches
inform Cassandra about the network topology so that requests are routed efficiently and allows
Cassandra to distribute replicas by grouping machines into data centers and racks. Specifically,
the replication strategy places the replicas based on the information provided by the new snitch.
All nodes must return to the same rack and data center. Cassandra does its best not to have
more than one replica on the same rack (which is not necessarily a physical location).
P: All nodes see same data at any time, or reads return latest written value by any client
Q: The system allows operations all the time, and operations return quickly
R: The system continues to work in spite of network partitions
Explanation:
CAP Theorem states following properties:
Consistency: All nodes see same data at any time, or reads return latest written value by any
client.
Availability: The system allows operations all the time, and operations return quickly.
Partition-tolerance: The system continues to work in spite of network partitions.
Statement 1: In Cassandra, during a write operation, when hinted handoff is enabled and If any
replica is down, the coordinator writes to all other replicas, and keeps the write locally until
down replica comes back up.
A. Key-value
B. Memtable
C. Gossip
D. Heartbeat
Answer: C) Gossip
Explanation: Cassandra uses a protocol called gossip to discover location and state
information about the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer
communication protocol in which nodes periodically exchange state information about
themselves and about other nodes they know about.
Answer: D) If writes stop, all reads will return the same value after a while
Explanation: Cassandra offers Eventual Consistency. Is says that If writes to a key stop, all
replicas of key will converge automatically.
Statement 1: When two processes are competing with each other causing data corruption, it is
called deadlock
Statement 2: When two processes are waiting for each other directly or indirectly, it is called
race condition
Statement 1: When two processes are competing with each other causing data corruption, it is
called Race Condition
Statement 2: When two processes are waiting for each other directly or indirectly, it is called
deadlock.
Q. 9 Which of the following is incorrect statement ?
Answer: D) The ZooKeeper framework was originally built at "Google" for accessing their
applications in an easy and robust manner
Explanation: The ZooKeeper framework was originally built at "Yahoo!" for accessing their
applications in an easy and robust manner
Q. 10 In Zookeeper, when a _______ is triggered the client receives a packet saying that the
znode has changed.
A. Event
B. Row
C. Watch
D. Value
Answer: C) Watch
Explanation: ZooKeeper supports the concept of watches. Clients can set a watch on a znodes.
A. Chunks
B. Ensemble
C. Subdomains
D. None of the mentioned
Answer: B) Ensemble
1 1943 10 1 14.1
2 1943 10 2 16.4
21223 2001 11 7 16
There exists same maximum temperature at different hours of the same day. Choose the correct
CQL query to:
Alter table temperature_details to add a new column called “seasons” using map of type
<varint, text> represented as <month, season>. Season can have the following values
season={spring, summer, autumn, winter}.
Update table temperature_details where columns daynum, year, month, date contain the
following values- 4317,1955,7,26 respectively.
Use the select statement to output the row after updation.
Note: A map relates one item to another with a key-value pair. For each key, only one value
may exist, and duplicates cannot be stored. Both the key and the value are designated with a
data type.
A)
cqlsh:day3> alter table temperature_details add hours1 set<varint>;
cqlsh:day3> update temperature_details set hours1={1,5,9,13,5,9} where daynum=4317;
cqlsh:day3> select * from temperature_details where daynum=4317;
B)
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where
daynum=4317 and year =1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and
month=7 and date=26;
C)
cqlsh:day3>alter table temperature_details add hours1 list<varint>;
cqlsh:day3> update temperature_details set hours1=[1,5,9,13,5,9] where daynum=4317 and
year = 1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and
month=7 and date=26;
Answer: B)
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where
daynum=4317 and year =1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and
month=7 and date=26;
Explanation:
The correct steps are:
___________________________________________________________________________
Quiz Assignment-V Solutions: Big Data Computing (Week-5)
___________________________________________________________________________
Q. 1 True or False ?
A. True
B. False
Answer: A) True
Explanation: Apache HBase is a column-oriented NoSQL database that runs on top of the
Hadoop Distributed File System, a main component of Apache Hadoop
Q. 2 A small chunk of data residing in one machine which is part of a cluster of machines
holding one HBase table is known as__________________
A. Rowarea
B. Tablearea
C. Region
D. Split
Answer : C) Region
Explanation: In HBase, table Split into regions and served by region servers.
A. 1
B. 2
C. 3
Answer : A) 1
A. Stores
B. HMaster
C. Region Server
D. Cell
Answer: D) Cell
Explanation: Data is stored in HBASE tables Cells and Cell is a combination of row, column
family, column qualifier and contains a value and a timestamp.
2. Region Server: HBase Tables are divided horizontally by row key range into Regions.
Regions are the basic building elements of HBase cluster that consists of the distribution of
tables and are comprised of Column families. Region Server runs on HDFS DataNode which
is present in Hadoop cluster.
Q. 6 True or False ?
Kafka is a high performance, real time messaging system. It is an open source tool and is a
part of Apache projects.
A. True
B. False
Answer: True
A. Chunks
B. Domains
C. Messages
D. Topics
Answer: D) Topics
Explanation: A topic is a category or feed name to which messages are published. For each
topic, the Kafka cluster maintains a partitioned log
Q. 8 True or False ?
Statement 1: Batch Processing provides ability to process and analyze data at-rest (stored
data)
Statement 2: Stream Processing provides ability to ingest, process and analyze data in-
motion in real or near-real-time.
A. Kafka Core
B. Kafka Connect
C. Kafka Streams
D. None of the mentioned
Explanation:
Kafka connect is a framework to import event streams from other source data systems into
Kafka and export event streams from Kafka to destination data systems.
Q. 11 ________________is a central hub to transport and store event streams in real time.
A. Kafka Core
B. Kafka Connect
C. Kafka Streams
D. None of the mentioned
___________________________________________________________________________
Quiz Assignment-VI Solutions: Big Data Computing (Week-6)
___________________________________________________________________________
Explanation: Credit card transactions can be clustered into fraud transactions using
unsupervised learning.
Explanation: When your model has to predict a numeric value instead of a category, then the
task becomes a regression problem. An example of regression is to predict the price of a
stock. The stock price is a numeric value, not a category. So this is a regression task instead
of a classification task.
Q. 3 ___________ refers to a model that can neither model the training data nor generalize to
new data.
A. Good fitting
B. Overfitting
C. Underfitting
D. All of the mentioned
Answer: C) Underfitting
Explanation: An underfit machine learning model is not a suitable model and will be
obvious as it will have poor performance on the training data. Usually, a model that is
underfit will have high training and high testing error.
Q. 4 Which of the following is required by K-means clustering ?
Q. 5 Imagine you are working on a project which is a binary classification problem. You
trained a model on training dataset and get the below confusion matrix on validation dataset.
Based on the above confusion matrix, choose which option(s) below will give you correct
predictions ?
1. Accuracy is ~0.91
2. Misclassification rate is ~ 0.91
3. False positive rate is ~0.95
4. True positive rate is ~0.95
A. 1 and 3
B. 1 and 4
C. 2 and 4
D. 2 and 3
Answer: B) 1 and 4
Explanation:
The true Positive Rate is how many times you are predicting positive class correctly so true
positive rate would be 100/105 = 0.95 also known as “Sensitivity” or “Recall”
Q. 6 Identify the correct method for choosing the value of ‘k’ in k-means algorithm ?
A. Dimensionality reduction
B. Elbow method
C. Both Dimensionality reduction and Elbow method
D. Data partitioning
Q. 7 True or False ?
If your model has very low training error but high generalization error, then it is overfitting.
A. True
B. False
Answer: A) True
Explanation: A related concept to generalization is overfitting. If your model has very low
training error but high generalization error, then it is overfitting. This means that the model
has learned to model the noise in the training data, instead of learning the underlying
structure of the data.
Statement I: The idea of Post-pruning is to grow a tree to its maximum size and then remove
the nodes using a top-bottom approach.
Statement II: The idea of Pre-pruning is to stop tree induction before a fully grown tree is
built, that perfectly fits the training data.
Explanation:
In post-pruning, the tree is grown to its maximum size, then the tree is pruned by removing
nodes using a bottom up approach.
With pre-pruning, the idea is to stop tree induction before a fully grown tree is built that
perfectly fits the training data.
Q. 9 Which of the following options is/are true for K-fold cross-validation ?
1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation result as
compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number of
observations.
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1, 2 and 3
Explanation: Larger k value means less bias towards overestimating the true expected error
(as training folds will be closer to the total dataset) and higher running time (as you are
getting closer to the limit case: Leave-One-Out CV). We also need to consider the variance
between the k folds accuracy while selecting the k.
Statement I: In supervised approaches, the target that the model is predicting is unknown or
unavailable. This means that you have unlabeled data.
Statement II: In unsupervised approaches the target, which is what the model is predicting, is
provided. This is referred to as having labeled data because the target is labeled for every
sample that you have in your data set.
___________________________________________________________________________
Q. 1 True or False ?
The bootstrap sampling method is a resampling method that uses random sampling with
replacement.
A. True
B. False
Answer: A) True
Q. 2 True or False ?
Statement 1: Maximum likelihood estimation is a method that determines values for the
parameters of a model. The parameter values are found such that they maximize the
likelihood that the process described by the model produced the data that were actually
observed.
Statement 2: Bagging provides an averaging over a set of possible datasets, removing noisy
and non-stable parts of models.
Explanation: When high cardinality problems, gain ratio is preferred over Information Gain
technique.
Q. 4 Given an attribute table shown below, which stores the basic information of attribute a,
including the row identifier of instance row_id , values of attribute values (a) and class labels
of instances c.
Attribute Table
A. Humidity
B. Outlook
C. Wind
D. None of the mentioned
Answer: B) Outlook
Where X is the resulting split, n is the number of different target values in the subset,
and pi is the proportion of the ith target value in the subset.
For example, the entropy will be the following. The log is base 2.
Entropy (Sunny) = -2/5 * log(2/5) – 3/5 log(3/5) =0.159+0.133= 0.292 (Impure subset)
A. 1 and 3
B. 1 and 4
C. 2 and 3
D. 2 and 4
Answer: A) 1 and 3
Explanation: Random forest is based on bagging concept, that consider faction of sample
and faction of feature for building the individual trees.
Q. 6 Which of the following is/are true about Random Forest and Gradient Boosting
ensemble methods?
A. 1 and 2
B. 2 and 3
C. 1 and 4
D. 2 and 4
Answer: C) 1 and 4
Explanation: Both algorithms are design for classification as well as regression task.
Q. 7 Boosting any algorithm takes into consideration the weak learners. Which of the
following is the main reason behind using weak learners?
A. Reason I
B. Reason II
C. Both the Reasons
D. None of the Reasons
Answer: A) Reason I
Explanation: To prevent overfitting, since the complexity of the overall learner increases at
each step. Starting with weak learners implies the final classifier will be less likely to overfit.
Q. 8 To apply bagging to regression trees which of the following is/are true in such case?
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1, 2 and 3
Answer: D) 1, 2 and 3
___________________________________________________________________________
Quiz Assignment-VIII Solutions: Big Data Computing (Week-8)
___________________________________________________________________________
Q. 1 Which of the following statement(s) is/are true in the context of Apache Spark GraphX
operators ?
S1: Structural operators operate on the structure of an input graph and produces a new graph.
S2: Property operators modify the vertex or edge properties using a user defined map
function and produces a new graph.
S3: Join operators add data to graphs and produces a new graphs.
Q. 2 GraphX provides an API for expressing graph computation that can model the
__________ abstraction.
A. GaAdt
B. Pregel
C. Spark Core
D. None of the mentioned
Answer: B) Pregel
A. A:ii, B: i, C: iii
B. A:iii, B: i, C: ii
C. A:ii, B: iii, C: i
D. A:iii, B: ii, C: i
Answer: B) A:iii, B: i, C: ii
Explanation:
First, dataflow systems such as Hadoop and spark. Data is independent records and push
through a processing pipeline.
Graph systems like Graphlab, etc. Problem is modeled as a graph, each node communicates
with its neighbors.
Distributed shared memory systems like Bosen, etc. Model is globally accessible and
changed by external workers.
Each of these types of systems offer a different abstraction as well such as:
Q. 4 What is the PageRank score of vertex B after the second iteration? (Without damping
factor)
A. 1/6
B. 1.5/12
C. 2.5/12
D. 1/3
Answer: A) 1/6
Explanation: The Page Rank score of all vertex is calculated as follows:
Statement 1: SSP interpolates between BSP (Bulk synchronous parallel) and Asynchronous
and subsumes both.
Q. 7 Which of the following are provided by spark API for graph parallel computations:
i. joinVertices
ii. subgraph
iii. aggregateMessages
A. Only (i)
B. Only (i) and (ii)
C. Only (ii) and (iii)
D. All of the mentioned
S1: Apache Spark GraphX provides the following property operators - mapVertices(),
mapEdges(), mapTriplets()
S2: The RDDs in Spark, depend on one or more other RDDs. The representation of
dependencies in between RDDs is known as the lineage graph. Lineage graph information is
used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the
data that is lost can be recovered using the lineage graph information.
A. Only S1 is true
B. Only S2 is true
C. Both S1 and S2 are true
D. None of the mentioned